A practical dilemma in employing species tree approaches is determining which loci to use. In addition to the evolutionary processes of incomplete lineage sorting (ILS) and gene flow, factors that can play a crucial role in the accuracy of species tree estimation include the number of independent loci sampled (Edwards et al. 2007; Edwards 2009), the variability and lengths of those loci (Kuhner et al. 2000; Knowles 2009; Camargo et al. 2012), and perhaps even the information content of the gene trees themselves (e.g., proportion of resolved nodes). A number of recent studies have explored the relationship between the number of loci sampled and species tree accuracy and found that only a modest number of loci may be necessary (Edwards et al. 2007; Camargo et al. 2012). Coalescent-based species tree estimation methods allow for the estimation of population genetic parameters, such as the population size (θ = 2Neμ) and divergence time (τ, generation length in years), which can be yet another gauge of phylogenetic accuracy. Theoretical and simulation work show that accurate estimates of θ and τ require multiple independent loci, because the addition of more loci is analogous to gaining independent replicates of the evolutionary processes (Kuhner et al. 2000; Edwards and Beerli 2000). Therefore, using fewer loci for species tree inference might have negative impacts on coalescent-based estimates of population genetic parameters (Felsenstein 2006). This leads to the question of whether there is an optimal range of loci that both minimizes the number of loci needed to estimate the correct species tree, but still provides an accurate estimate of population genetic parameters.
New methods of data acquisition coupled with next-generation sequencing are providing a wealth of data (Hudson 2008; Lerner and Fleischer 2010), but Bayesian species tree estimation is presently incapable of dealing with large numbers of loci. Researchers hence need to decide which subset of loci from the hundreds or thousands of available loci should be included in species tree analyses. Here we investigate the issue of how many loci are needed to infer accurate species tree relationships, and whether loci with particular attributes (e.g., parsimony informativeness, variability, gene tree resolution) outperform others. We use subsampling strategies to evaluate the optimal sampling approach when loci are chosen randomly or based on certain characteristics, including sequence variation, the number of nodes resolved, and total information content. First, we examine how many loci are needed to estimate the reference tree (i.e., probable “true” tree) when chosen under different sampling strategies. Second, we determine how the number and information content of loci influences population genetic parameter estimates (θ and τ). Third, we compare the relative effectiveness and consistency of four species tree inference methods.
We address these questions with an empirical data set focused on the New World chickadees (Aves: Paridae) composed of 40 anonymous nuclear sequence loci and mitochondrial DNA (mtDNA) sequences. The seven congeneric species of chickadees (Poecile atricapillus, Poecile rufescens, Poecile cinctus, Poecile hudsonicus, Poecile carolinensis, Poecile sclateri, and Poecile gambeli) are an analytically tractable group with which to explore species tree inference owing to their modest species diversity and relatively recent divergence and large population size. The phylogenetic relationships of Poecile are of general interest because the clade is used widely in ecological and behavioral research (e.g., Chaplin 1974; Healy and Krebs 1996; Gould et al. 2001; Hill et al. 1980; Olson et al. 2010), making an understanding of its evolutionary history important for comparative analyses.
The most robust previous phylogenetic hypotheses for the chickadees used mtDNA to determine that Poecile includes two well-defined groups: ([hudsonicus, rufescens], cinctus) and (atricapillus, gambeli) (Gill et al. 2005; see also Johansson et al. 2013). However, the relationships among these groups relative to each other and the two remaining species, P. carolinensis and P. sclateri, were poorly resolved (Gill et al. 2005). One surprising result was that the morphologically similar species P. atricapillus and P. carolinensis were not sister taxa. Rather, the more morphologically and behaviorally divergent species P. atricapillus and P. gambeli were sister in the mtDNA tree, a relationship consistent with previous reconstructions based on mtDNA restriction site character matrices (Gill et al. 1993).
In this study, we also test for evidence of gene flow between three Poecile species known to hybridize: P. atricapillus, P. gambeli, and P. carolinensis. Poecile atricapillus and P. carolinensis have long been known to hybridize along an extensive zone of contact in eastern North America (Robbins et al. 1986; Bronson et al. 2001; Reudink et al. 2007). Poecile atricapillus and P. gambeli occasionally hybridize in the mountains of western North America (Curry 2005). Gene flow among species is a potential source of gene tree discordance (Slatkin and Maddison 1989; Maddison 1997). The difficulty of distinguishing instances of incongruence stemming from ILS or gene flow has impeded the development of phylogenetic methods that can accommodate both the processes simultaneously (but see Kubatko 2009; Yu et al. 2012), yet failing to account for gene flow during species tree estimation effects parameter estimation (Leaché et al. 2013). Here, we test if gene flow influences species tree estimation even when species sampling is specifically designed to decrease the signal of hybridization. Sampling species from distant populations may help mitigate potentially confounding affects of gene flow. However, it is especially important to measure gene flow in this group, as it is a potential source of conflict between the nuclear and mitochondrial gene trees.