Taxon sampling to address an ancient rapid radiation: a supermatrix phylogeny of early brachyceran flies (Diptera)

Early diverging brachyceran fly lineages underwent a rapid radiation approximately 180 Ma, coincident in part with the origin of flowering plants. This region of the fly tree includes 25 000 described extant species with diverse ecological roles such as blood‐feeding (haematophagy), parasitoidism, predation, pollination and wood‐feeding (xylophagy). Early diverging brachyceran lineages were once considered a monophyletic group of families called Orthorrhapha, based on the shared character of a longitudinal break in the pupal skin made during the emergence of the adult. Yet other morphological and molecular evidence generally supports a paraphyletic arrangement of ‘Orthorrhapha’, with strong support for one orthorrhaphan lineage – dance flies and relatives – as the closest relative to all higher flies (Cyclorrhapha), together called Eremoneura. In order to establish a comprehensive estimate of the relationships among orthorrhaphan lineages using a thorough sample of publicly available data, we compiled and analysed a dataset including 1217 taxa representing major lineages and 20 molecular markers. Our analyses suggest that ‘Orthorrhapha’ excluding Eremoneura is not monophyletic; instead, we recover two main lineages of early brachyceran flies: Homeodactyla and Heterodactyla. Homeodactyla includes Nemestrinoidea (uniting two parasitic families Acroceridae + Nemestrinidae) as the closest relatives to the large SXT clade, comprising Stratiomyomorpha, Xylophagidae and Tabanomorpha. Heterodactyla includes Bombyliidae with a monophyletic Asiloidea (exclusive of Bombyliidae) as the closest relatives to Eremoneura. Reducing missing data, modifying the distribution of genes across taxa, and, in particular, removing rogue taxa significantly improved tree resolution and statistical support. Although our analyses rely on dense taxonomic sampling and substantial gene coverage, our results pinpoint the limited resolving power of Sanger sequencing‐era molecular phylogenetic datasets with respect to ancient, hyperdiverse radiations.


Introduction
With revolutionary advances in bioinformatics and molecular genetics, ever larger phylogenomic datasets have become possible, and, in fact, necessary, for resolving major evolutionary questions across the tree of life (Smith et al., 2009;Trautwein et al., 2012;Misof et al., 2014;Yeates et al., 2016;Chesters, 2016). The supermatrix approach, a compilation of all available sequence data into a new, large, single analysis, can be a powerful method to build large-scale phylogenetic trees (de Queiroz & Gatesy, 2007). Because they are opportunistically compiled, supermatrices are often incomplete; in some cases, as much as 70-95% of the data are coded as missing (McMahon & Sanderson, 2006;Smith et al., 2009). Despite this incompleteness, the potential pitfalls of inconsistent data distribution and data overlap (see Dell'Ampio et al., 2014) and the ensuing methodological difficulties, supermatrix methods allow for simultaneous analysis of data from a diversity of sources. Supermatrices have been used to reconstruct phylogenetic relationships across diverse lineages in the tree of life, including kingdoms of life (Ciccarelli et al., 2006), mammals (Meredith et al., 2011), reptiles (Thomson & Shaffer, 2010), birds (Burleigh et al., 2015;Jønsson et al., 2016), various groups of plants (Pirie et al., 2008;Soltis et al., 2011;Hinchliff & Roalson, 2013) and diverse insect clades (van der Linde et al., 2010;Peters et al., 2011;Hedtke et al., 2013;Bocak et al., 2014;Kergoat et al., 2014;Piwczyński et al., 2014). Here we apply supermatrix methods to further resolve the phylogeny of flies by taking advantage of existing molecular data.
The evolution of Brachycera brought about fly lineages with stout bodies and strong aerobatic skills -key characteristics of flies for many people -in contrast to earlier diverging mosquito-like or 'nematocerous' lineages. Brachycera and their sister Bibionomorpha also represent an evolutionary shift from a strong dependence on aquatic environments to terrestriality. The majority of larvae of the earliest 'nematocerous' fly lineages live in aquatic or inundated environments and are saprophagous or phytophagous, whereas the larvae of early brachyceran families are largely land-dwellers that have undergone a major transition to predatory diets, in soil or rotting wood. Larvae of Vermileonidae are notable for specialized predatory behaviour in constructing pit traps. Furthermore, Acroceridae, Nemestrinidae and Bombyliidae larvae are parasitoids of other arthropods. Among parasitoid lineages, there are species that exhibit hypermetamorphosis (at least one larval stage differs from the rest, generally mobile while finding a host, then immobile afterwards) and hyperparasitism (where a parasitoid lives in another parasite). Another distinguishing feature of early brachyceran evolution is less mobile pupae with a more heavily reinforced pupal cuticle that includes strong, posteriorly directed spines and setae that aid in emergence from restrictive substrates such as soil and decaying wood; notwithstanding this, the pupae of early brachyceran lineages are still capable of movement. This contrasts with the immobile and unadorned pupae of cyclorrhaphan flies that are encased in the hardened and capsule-like last larval cuticle -the puparium (Woodley et al., 2009). The phylogenetic relationships of early diverging brachyceran fly families have been a challenge to resolve well into the era of molecular phylogenetics (Yeates & Wiegmann, 1999;Wiegmann et al., 2003;Wiegmann et al., 2011). Non-cyclorrhaphan brachyceran families (Table 1) have traditionally been grouped together as Orthorrhapha. The name Orthorrhapha derives from the straight line of breakage in the pupal skin made during the emergence of the adult, in contrast to the circular exit hole in the puparium that characterizes Cyclorrhapha (Brauer, 1880;Lameere, 1906). Despite this characteristic, Orthorrhapha has long been suspected to be paraphyletic, lacking both robust morphological and molecular support (Woodley, 1989;Yeates and Wiegmann, 1999;Yeates, 2002;Wiegmann et al., 2003); in fact, a monophyletic Orthorrhapha in the traditional sense (all non-cyclorrhaphan brachycerans, including Empidoidea) has not been advanced by any recent study.
In contrast to the concept of a monophyletic Orthorrhapha, morphological and molecular studies generally show evidence in favour of joining the orthorrhaphan infraorder Empidoidea (dance flies and relatives) as the sister group to Cyclorrhapha (e.g. Woodley, 1989), together called Eremoneura. In addition to Eremoneura, there is morphological and molecular evidence for Heterodactyla, a group uniting Asiloidea (robber flies and relatives; see Table 1) and Eremoneura (Woodley, 1989). The name Heterodactyla refers to all brachyceran flies with the empodium, the medial tarsal lobe, reduced to be bristle-like or lost (Lambkin et al., 2013). The division of Brachycera based on a bristle-like empodium led to a reciprocal uniting of all groups with a pulvilliform empodium (specifically narrow mediolobus) under the term Homeodactyla (Stuckenberg, 2001) -Stratiomyomorpha, Tabanomorpha, Nemestrinoidea, and Xylophagidae -yet this designation was made without any consideration as to its monophyly. Contrary to Homeodactyla is the more widely accepted Muscomorpha -a clade composed of Heterodactyla + Nemestrinoidea (Nemestrinidae + Acroceridae) (Woodley, 1989) (Table 1), that is supported by multiple apparent morphological synapomorphies, including the antennal flagellum reduced to four or fewer articles, loss of tibial spurs, female cerci 1-segmented, and the structure of the larval antennae (Lambkin et al., 2013).
The first large-scale molecular analysis to address higher-level fly phylogeny returned unexpected results regarding non-cyclorrhaphan brachycerans, with support for a monophyletic Orthorrhapha excluding Empidoidea, as the sister group to Eremoneura (Empidoidea + Cyclorrhapha) . Orthorrhapha excluding Empidoidea were termed Platygenya by Brauer (1883, Table 1), but this term was not widely adopted, so throughout we define Orthorrhapha as Orthorrhapha sensu Wiegmann et al. (2011), excluding Empidoidea. In Wiegmann et al. (2011), a monophyletic Orthorrhapha was supported only in analyses that maximized taxon sampling, whereas analyses of more genes but fewer taxa   Figure S1), or analyses of morphological data alone (Lambkin et al., 2013) recovered a paraphyletic arrangement of the early brachyceran lineages.
Aside from the monophyly of Orthorrhapha, questions about the relationships between major families of non-cyclorrhaphan Brachycera (reviewed in Woodley, 1989;Yeates, 2002;Wiegmann et al., 2003Wiegmann et al., , 2011Sinclair et al., 2013) (Fig. 1) remain to be resolved. Not settled yet, for instance, is the question of whether the two parasitic families Nemestrinidae (parasitoids of insects) and Acroceridae (parasitoids of spiders) are a monophyletic group (= Nemestrinoidea, Woodley, 1989). Weak morphological evidence supports Nemestrinoidea (Woodley, 1989), whereas molecular analyses place the three parasitic families (including Bombyliidae) as distantly related (Winterton et al., 2007;Wiegmann et al., 2011). Another questionable higher-level grouping is the relationship between three infraorders Stratiomyomorpha (soldier flies and relatives), Xylophagomorpha (only including Xylophagidae) and Tabanomorpha; these are sometimes gathered together in what is known as the SXT clade (Yeates, 2002;Wiegmann et al., 2003), although this clade is often weakly supported by both molecules and morphology. The monophyly of Asiloidea including Bombyliidae is another outstanding hypothesis that has weak support based on morphology and has been challenged by molecular data (Trautwein et al., 2010;Wiegmann et al., 2011).
Other key questions regarding early brachyceran phylogeny involve the placement of two unusual and rarely-collected fly genera -Apystomyia Melander and Hilarimorpha Schiner. On the one hand, Apystomyia was formerly considered a putative member of Asiloidea and grouped in Bombyliidae or together with Hilarimorpha in the family Hilarimorphidae, but has recently and consistently been placed as the closest relative to the large radiation of Cyclorrhapha, based on molecular data (Trautwein et al., 2010. Morphological interpretations of the male genitalia of Apystomyia place it instead as the sister group to Eremoneura (Empidoidea + Cyclorrhapha; see Yeates, 2002;Sinclair et al., 2013). Hilarimorpha, on the other hand, remains completely uncertain in its phylogenetic placement, with multiple ambiguous alternatives found in nearly all molecular (Trautwein et al., 2010;Wiegmann et al., 2011) and morphological (Yeates, 1994(Yeates, , 2002 analyses. The aim of this study is to recover the phylogenetic relationships of early diverging brachyceran fly lineages by compiling and analysing existing nucleotide data to determine whether increased taxon sampling (even in a patchy supermatrix) can confirm or refute existing phylogenetic hypotheses. We constructed a supermatrix using data from previous studies (see Materials and Methods) and added taxa to the ingroup and outgroup using methods for supermatrix data mining from GenBank (outlined in Peters et al., 2011). We evaluated the impact of minimizing missing data, differential taxon coverage, and excluding rogue taxa in various analyses. Our findings are based on the largest molecular dataset that has been compiled to date for resolving the phylogenetic relationships of flies.

Sampling design
We included a total of 1217 species of outgroup and ingroup taxa in our supermatrix analyses (Table S1). Our dataset included 399 species of Asiloidea, 102 species of Stratiomyomorpha, 224 species of Tabanomorpha, 43 species of Acroceridae, seven species of Nemestrinidae, eight species of Xylophagidae, one species of Apystomyiidae and one species of Hilarimorphidae from previous phylogenetic studies. We additionally downloaded and parsed sequences up to May 2014 using scripts published in Peters et al. (2011) to include further ingroup species of Eremoneura (368 Empidoidea and 50 Cyclorrhapha) and included in total 14 species belonging to Bibionomorpha, representing nine distinct families of Neodiptera (uniting Brachycera + Bibionomorpha, see Wiegmann et al., 2011) as outgroup taxa ( Figure S7).

Alignments
We selected 20 genes previously used for phylogenetic estimates addressing non-cyclorrhaphan brachyceran fly lineages that are well represented in GenBank. These sequences include 14 nuclear protein-coding genes (AATS1, AATS2, ACE-1, CAD, EF1A, G6PD, PEPCK, PER, PGD, PUG, Rhodopsin, SINA, SNF and TPI), the small and large subunit of nuclear ribosomal RNA genes 18S rRNA and 28S rRNA, two mitochondrial protein-coding genes (COI and COII), and the small and large subunits of the mitochondrial ribosomal RNA genes 12S rRNA and 16S rRNA. Sequences of previously published datasets were collected from the following studies: Yang et al., 2000;Winterton et al., 2001Winterton et al., , 2007Wiegmann et al., 2003;Brammer & von Dohlen, 2007;Dikow, 2009a;Trautwein et al., 2010Trautwein et al., , 2011Wiegmann et al., 2011;Lessard et al., 2013;Winterton & Ware, 2015;Morita et al., 2016. All sequences for each study were re-aligned using MAFFT (Katoh &  Standley, 2013) with the Q-INS-i algorithm for the mitochondrial and nuclear rRNA genes, and with the FFT-NS-i algorithm for all protein-coding genes. Alignments were checked by eye using mega 6.06 (Tamura et al., 2013) and Seaview 4 (Gouy et al., 2010). Obvious errors or frame shifts were corrected manually. We subsequently added sequences from ingroup species belonging to Eremoneura and outgroup species belonging to Bibionomorpha using the profile alignment strategy 'MAFFT --add' (Katoh & Standley, 2013) individually for each gene via the web server (v7; http://mafft.cbrc.jp/alignment/server/add_ sequences.html) with L-INS-1 for rRNA with settings enabled to maintain secondary structure inferences for the previously curated rRNA datasets and using FFT-NS-1 for protein-coding genes. The third codon positions were removed from protein coding genes using the Perl script 'selectSites.pl' (http://raven .iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/ perl-scripts.html) to reduce the high level of noise included in these sites (see Trautwein et al., 2010;Wiegmann et al., 2011 for third position heterogeneity). SequenceMatrix OSX 1.7.8 (Vaidya et al., 2011) was used to concatenate all genes. Duplicates were checked by eye in a text editor and duplicate sequences/taxa were removed based on high sequence identity over a defined sequence length in SequenceMatrix OSX 1.7.8 (Vaidya et al., 2011). We used only the longest sequence for each gene available for a particular taxon. Our largest supermatrix included 1217 taxa and 20 genes comprising an alignment length of 21 369 sites of which 83% was coded as missing (Table 2).

Data filtering and tree searching
We analysed six datasets to evaluate the robustness of inferred phylogenies under different treatments to address missing data, taxonomic coverage of genes and rogue taxa removal. Our datasets are as follows ( From each dataset we constructed an additional subsequent dataset by removing the top 10% of instability scored taxa (or rogue taxa) (Table S1). We measured leaf stability of included species using 'instability_multicore.py' (http://dx.doi .org/10.5061/dryad.6p76c3pb) to rapidly calculate instability scores for large sets of very large trees (Hinchliff & Roalson, 2013). Our datasets with rogue taxa removed are as follows (  (Stamatakis, 2014) was employed to infer maximum-likelihood (ML) trees from each concatenated supermatrix dataset (i-vi) on the North Carolina State University Bioinformatics Research Center cluster (NCSU BRC cluster http://brccluster.statgen.ncsu.edu/). First, we performed ten single ML tree searches with a random start tree, with the dataset partitioned by gene, with linked branch lengths to find the best likelihood tree for each dataset. We tested both approaches, GAMMA and CAT with the substitution model GTR on the largest dataset (ALL) and using only GTR + GAMMA (with default four rate categories) for remaining datasets because parameter estimation is ostensibly more accurate. Next, we applied 100 thorough nonparametric bootstrap replicates to estimate branch support with the GTR + GAMMA model. After analysis, we also performed 1000 partitioned thorough nonparametric bootstrap replicates for a representative tree from dataset MISSING_ROGUE (v) (Fig. 3, Figure S5, tree was selected by topology assessments of majority rule consensus trees in Fig. 2). The bootstrap convergence was tested (-I autoMRE option by RAxML) with 1000 BT (trees converged after 300 bootstraps). Resulting bootstrap support was mapped onto the best ML trees ( Figures S1-S6). Finally, partial tree-wise decisiveness (PDC) was calculated using the program 'decisivator' (https://github.com/josephwb/Decisivator) for our six data sets. To compare branches showing >50% ML Bootstrap Support (MLBS) from each tree, we assessed topologies with majority rule consensus trees from the 100 bootstrap replicates for each dataset using RAxML (Fig. 2). ML trees and concatenated Phylip files are provided as Supplementary materials (Dryad accession #; https://doi.org/10.5061/dryad.cq63m).

Results and Discussion
Here we present the most densely sampled phylogeny to date, in terms of species (1217) and genes (20) addressing relationships of non-cyclorrhaphan Brachycera. Our topologies from all six ML analyses recover strikingly similar arrangements of families across non-cyclorrhaphan brachyceran flies. A monophyletic Orthorrhapha is never recovered (see Wiegmann et al., 2011)  Although the topologies that we recovered are consistent across all datasets, the bootstrap support for deeper relationships in the brachyceran tree is weak, despite extensive taxon sampling. We find that reducing the amount of missing data and therefore increasing data overlap (either by total percentage, or by maximizing the inclusion of genes that have broad taxonomic coverage) has mixed effects on bootstrap support for major clades. In general, an increase in data overlap, in terms of an increasing amount of data present, does not substantially increase node support (Table 3). In contrast, removing rogue taxa alone, or removing rogues along with reducing the amount of missing data, appears to substantially increase branch support across the tree, as well as decisiveness (for further discussion on the term 'decisiveness' in the context of phylogenomic analyses, see Sanderson et al., 2010) (Table 2). Note that we did not address the distribution of missing data that also can have a major impact if distributed unevenly (see Misof et al., 2013).
Our findings both confirm and contest previous hypotheses of non-cyclorrhaphan brachyceran relationships based on morphological and molecular data. We consistently recover two primary brachyceran lineages: Homeodactyla and Heterodactyla (Table 1). Homeodactyla includes a monophyletic SXT clade (including monophyletic Stratiomyomorpha, monophyletic Xylophagidae and monophyletic Tabanomorpha) as the closest relative to a monophyletic parasitoid clade, Nemestrinoidea (Nemestrinidae + Acroceridae) (Figures S1-S6). Within Heterodactyla, we consistently recover a monophyletic Eremoneura and a monophyletic Asiloidea, excluding monophyletic Bombyliidae [datasets (i) and (v) only; Evocoa chilensis is spuriously sister taxon to Eremoneura in all other analyses]. Bombyliidae, a notoriously rogue lineage, evades stable placement, yet is always placed close to either Asiloidea or Eremoneura, congruent with earlier studies (e.g. Wiegmann et al., 2003Wiegmann et al., , 2011. Also, we confirm a sister-group relationship between the small, rare family Apystomyiidae and Cyclorrhapha. Lastly, the enigmatic family Hilarimorphidae remains somewhat phylogenetically ambiguous as it is placed as the closest relative to Homeodactyla across all analyses but with very low branch support (Table 3, Figures S1-S6).
SXT clade. All of our analyses recover a monophyletic SXT clade joining Stratiomyomorpha, Xylophagidae and Tabanomorpha, although the highest bootstrap support is 48% (Table 3). Previous molecular analyses showed weak or no support for an SXT clade (Wiegmann et al., 2003, and likewise strong morphological synapomorphies for the group are lacking (Woodley, 1989;Yeates & Wiegmann, 1999).
The monophyly and placement of infraorder Stratiomyomorpha found by our analyses corresponds with the morphological hypothesis of Yeates (2002) and molecular analyses of Wiegmann et al. (2003) (Fig. 1). An artefactual placement of the species Solva marginata (Meigen) (Xylomyidae) within Stratiomyidae is likely due to the limited data available for this species. Solva marginata is represented only by 12S and 16S mitochondrial ribosomal RNA; another representative Solva sp. CB0434 (28S nuclear ribosomal RNA and nuclear protein coding EF1A) grouped with Xylomyidae as the sister of Xylomya   Rondani ( Figure S5), and this position is more plausible (Brammer & von Dohlen, 2010). The relationship of Tabanomorpha and Xylophagidae is largely congruent with recent morphological and molecular phylogenetic results from Yeates (2002) and Wiegmann et al. (2003). Rhagionidae are paraphyletic and the superfamily Rhagionoidea may include Vermileonidae, Austroleptidae and Bolbomyiidae.

Nemestrinoidea.
A consistent result is the recovery of a monophyletic, moderately supported Nemestrinoidea, with Nemestrinidae and Acroceridae as sister groups (bootstrap support 59-76%, Table 3). This finding is in agreement with the concept of Nemestrinoidea from Woodley (1989) but rejects Hennig (1973) who also included Bombyliidae, which is the other non-cyclorrhaphan brachyceran family of parasitoids (Yeates & Greathead, 1997). Therefore, hypermetamorphic parasitoidism in Nemestrinoidea and Bombyliidae is likely to have evolved independently.
The recovery of a monophyletic Nemestrinoidea is surprising in the sense that this result is in contrast to all previous morphological and molecular phylogenetic analyses that have addressed the relationship between Nemestrinidae and Acroceridae (Yeates, 2002;Wiegmann et al., 2003;Winterton et al., 2007;Wiegmann et al., 2011). Previous findings based on molecular data place Nemestrinidae separate from Acroceridae in various arrangements, with both families paraphyletic with respect to Asiloidea (Winterton et al., 2007), or with Nemestrinidae as the closest relative to Xylophagidae (Trautwein et al., 2010;Wiegmann et al., 2011). Also, from a morphological perspective, Woodley (1989) noted that Nemestrinoidea are only weakly supported by the presence of parasitic, hypermetamorphic larvae (shared by Bombyliidae).
A putative splitting of Brachycera into Homeodactyla and Heterodactyla upends hypotheses about 'Orthorrhapha' being a comb of lineages, from the saprophagous Stratiomyomorpha to Tabanomorpha with predatory larvae and many blood-feeding adults, to Asiloidea with predatory or parasitoid larvae and some predatory adults. Instead, our findings imply that two lineages of Brachycera followed independent evolutionary trajectories. Considering the limited bootstrap support, however, this hypothesis requires more investigation.

Progress on the placement of small, phylogenetically ambiguous families
Apystomyiidae, a monotypic family with only one known extant species, Apystomyia elinguis Melander, but several extinct species (Grimaldi, 2016), was re-collected in 2005 after decades of evading capture since its original discovery in the 1940s and description by Melander in 1950(Melander, 1950. All subsequent molecular analyses have found Apystomyia to be the sister group to the large radiation of Cyclorrhapha with high support (Trautwein et al., 2010;Wiegmann et al., 2011). We also find Apystomyia as closest relative to Cyclorrhapha in the majority of our analyses, although lacking strong bootstrap support. Morphological interpretations of the male genitalia of Apystomyia place it instead as sister group to Eremoneura (Yeates, 2002;Sinclair et al., 2013). In two analyses (ALL and COVERAGE_ROGUE Figures S1 and S6, MLBS <17%), the cyclorrhaphan Lonchoptera Meigen spuriously grouped with Apystomyia. The genus Lonchoptera was represented only by a single fragment of 28S rRNA (Table S1) and all representatives were removed as rogue taxa in further analyses of ALL and MISSING (Table S1). Further phylogenomic analyses are needed to confirm the placement of the rarely-collected Apystomyia fly.
In congruence with previous studies, we find that the phylogenetic placement of Hilarimorphidae cannot be elucidated confidently. Hilarimorphidae is the sister group to Homeodactyla in our analysis and had similar placement in Wiegmann et al. (2011), but in all cases with negligible bootstrap support (<19% ,  Table 3). We do not recover (Hilarimorpha + Apystomyia) as a clade close to Bombyliidae as suggested by Yeates (1994). Although there are some morphological characters that corroborate a relationship between Hilarimorphidae and Eremoneura (abdominal tergite 9 is absent in females of both Hilarimorphidae and Eremoneura, and both lineages lack the lateral ejaculatory processes (Lambkin et al., 2013)), phylogenetic inferences using morphological data have not clearly solved the phylogenetic affinities of Hilarimorphidae (Sinclair & Cumming, 2006;Sinclair et al., 2013). The larvae of Hilarimorphidae, another potential source of phylogenetically informative morphological characters, are unknown. Hilarimorphidae have not been consistently placed with strong support in any molecular phylogeny to date (Trautwein et al., 2010;Wiegmann et al., 2011) and therefore remain the most phylogenetically ambiguous non-cyclorrhaphan brachyceran fly family.

Challenges for supermatrix analyses
Lineage-and gene-specific differences in phylogenetic information, evolutionary rate and undetected paralogy are critical problems for constructing massive, taxon-rich supermatrices.
We sourced data in two different ways for our supermatrix compilation: (i) we used previously curated datasets from published studies of non-cyclorrhaphan brachyceran phylogeny, and (ii) downloaded additional sequences from NCBI using criteria and methods recommended by Peters et al. (2011) in their supermatrix construction pipeline. Sequences for Eremoneura (Empidoidea + Cyclorrhapha) and nematocerous outgroups were obtained through NCBI. Similar to the mega-phylogeny approach (Smith et al., 2009), we chose to include only genes from NCBI that had previously been used in phylogenetic analyses.
The previously curated datasets were carefully compiled by expert systematists in each group. In contrast, the sequences acquired by data mining GenBank were mostly algorithmically trimmed. In compiling our previously curated dataset, MAFFT (Katoh & Standley, 2013) was especially valuable due to its incorporation of secondary structure-considered alignment option for ribosomal genes (Q-INS-i). In an initial set of alignment estimates that treated the fully concatenated dataset as a single alignment, the ribosomal gene sections yielded spurious alignments with large numbers of gaps that would have to be corrected by manual alignment. We then used MAFFT --add, a profile alignment option, which allowed us to constrain previously curated data blocks so that we could subsequently add large numbers of variable length sequences available in NCBI. Alignment complexity due to the availability of highly variable gene fragments may be the most important challenge for the construction of meaningful large datasets from existing data. All applications of automated alignment programs for our datasets resulted in misaligned sequences that required modification by translation and manual editing.
The quality differences in our previously curated data as compared to GenBank-sourced data are evident in our analyses and results. One of our most notable results is that rogue taxa have a strong impact on topology and support values in comparison to the impact of missing data (Fig. 2, Tables 2, 3); as it turns out, the majority (97%) of rogue taxa were mined from GenBank, whereas the rogue taxa from previously curated datasets make up only 3% of all filtered rogues (Table S1). Also, 50% of filtered rogue taxa from dataset (i) are represented by only one gene (Table S1) (see Brower, 2017). Furthermore, the phylogenetic resolution of the GenBank-sourced data is poor, resulting in a polytomy for Eremoneura in all majority-rule consensus trees (Fig. 2). In contrast, the previously curated data for other brachyceran lineages provide a largely resolved tree supported in some cases with moderately high (>50) bootstraps. We suspect that the poor performance of the GenBank-sourced data (e.g. Chesters, 2016) in large part results from poor alignment due to variable sequence quality, length or amplicons originating from different gene fragments, and/or possibly misidentified chimeric species based on different genes.
Missing data and rogue taxa. A high percentage of missing data and rogue taxa can be attributes of supermatrix analyses that pull together data from published studies using Sanger data. The most resolved, well-supported tree in our study inferred from the dataset (v) MISSING_ROGUE included the lowest number of genes (10 of 20), and the lowest percentage of missing data (66%) after removal of rogue taxa (Figs 2, 3). Comparing six datasets, only (v) and (iv) support a monophyletic Tabanomorpha in 50% Majority-rule consensus trees (Fig. 2). Dataset (v) has 664 recovered nodes, and dataset (iv) has 640 recovered nodes while both have same number of taxa. Therefore, we used dataset (v) MISSING_ROGUE as the representative tree for our study.
Aside from reducing overall missing data, we also examined the effects of removing genes that lacked broad taxonomic coverage (inclusion required coverage across at least three infraorders). The COVERAGE dataset (iii) included 14 out of 20 genes, and yielded similarly low bootstrap support for major clades as those that had less missing data [dataset (ii), MISSING]. Yet analyses subsequent to rogue taxa removal showed that minimizing missing data improved bootstrap support substantially more than maximizing taxonomic coverage (Table S1).
Overall, removing rogue taxa from all datasets had the highest impact on improving resolution, support values and phylogenetic decisiveness across the brachyceran tree (Tables 2, 3).

The use of supermatrices for phylogenetic resolution of ancient, hyperdiverse radiations
This study shows both the potential and the limits of traditional Sanger-based molecular systematic datasets that include relatively few genes (∼20 loci) to resolve deep phylogeny within a hyperdiverse taxon. With few genes and a high percentage of missing data, the most challenging radiations may resist resolution no matter how dense the taxon sample (see Dell'Ampio et al., 2014). Like previous analyses of such relationships, we find plausible suggestions of phylogenetic history, but many questions remain due to the complexity of character systems, high diversity and ancient rapid radiations. However, the compilation of these independently derived datasets from multiple sources into a comprehensive study provides an opportunity to explore the impacts of taxon sampling, missing data, data distribution and rogue taxa on tree topologies and statistical support. We improved robustness of the inferred phylogenetic relationships by removing rogues and poorly sampled genes from the all-inclusive supermatrix.
Molecular phylogenetic studies have reached a turning point where Sanger sequencing-based methods may often no longer be a cost-effective alternative to next-generation sequencing methods (Trautwein et al., 2012;Yeates et al., 2016). Parallel developments in bioinformatics, computational speed and high-performance computing have changed large-scale phylogenetic analysis. Most recently, analyses based on large-scale genomic data have been successful in resolving controversial problems within insect evolution (e.g. Misof et al., 2014;Peters et al., 2017). Our supermatrix analysis is a benchmark for non-cyclorrhaphan brachyceran phylogeny and a precedent for phylogenomic studies that rely on transcriptome sequencing and genome reduction methods such as anchored hybrid enrichment (e.g. Young et al., 2016) that are now underway.

Supporting Information
Additional Supporting Information may be found in the online version of this article under the DOI reference: 10.1111/syen.12275 Figure S1. Best ML tree with branch support from 100 thorough bootstrap replicates for dataset (i), ALL.       Table S1. Gene and taxon coverage for dataset (i) with species name, rogue taxa selection list for all six datasets.