High levels of interspecific gene flow in an endemic cichlid fish adaptive radiation from an extreme lake environment

Abstract Studying recent adaptive radiations in isolated insular systems avoids complicating causal events and thus may offer clearer insight into mechanisms generating biological diversity. Here, we investigate evolutionary relationships and genomic differentiation within the recent radiation of Alcolapia cichlid fish that exhibit extensive phenotypic diversification, and which are confined to the extreme soda lakes Magadi and Natron in East Africa. We generated an extensive RAD data set of 96 individuals from multiple sampling sites and found evidence for genetic admixture between species within Lake Natron, with the highest levels of admixture between sympatric populations of the most recently diverged species. Despite considerable environmental separation, populations within Lake Natron do not exhibit isolation by distance, indicating panmixia within the lake, although individuals within lineages clustered by population in phylogenomic analysis. Our results indicate exceptionally low genetic differentiation across the radiation despite considerable phenotypic trophic variation, supporting previous findings from smaller data sets; however, with the increased power of densely sampled SNPs, we identify genomic peaks of differentiation (F ST outliers) between Alcolapia species. While evidence of ongoing gene flow and interspecies hybridization in certain populations suggests that Alcolapia species are incompletely reproductively isolated, the identification of outlier SNPs under diversifying selection indicates the radiation is undergoing adaptive divergence.


Estimation of the extent of linkage disequilibrium
We estimated LD for each species using the R package snpstats (Clayton & Leung 2007), (with R scripts modified from (Martin et al. 2013) ) using the reference-aligned dataset. We estimated LD for pairs of SNPs within each linkage group, and then averaged the correlation coefficient r 2 across all linkage groups within pre-specified distance bins. We did not include in our calculations any scaffolds that were not assigned to specific linkage groups of the reference genome (i.e., those positions within the genome designated as unknown scaffolds, which accounted for 22% of our full alignment; dataset C, table S2). Varying the values of the MAF threshold and the allowed limit of missing data, as well as the number of individuals included per species, had large effects on the dropoff rate of LD, so we used conservative thresholds in our final analysis. LD was estimated for each species separately using only SNPs from the aligned dataset with a minimum allele frequency (MAF) of 0.2 and for sites at which there was data for a minimum of 80% of individuals. We present LD dropoff plots for analysis using 12 individuals per species for all the Alcolapia species. Additionally, for A. grahami we include a plot for n=8 which includes only the Lake Magadi samples, and excludes those from the introduced population in Lake Nakuru. Given the apparent divergence between lakes for O. amphimelas, we present only the analysis of LD dropoff for O. amphimelas samples from Lake Eyasi (n=4); although including the additional samples from Lake Manyara had little impact on the result (data not shown).

Additional STRUCTURE analysis
In addition to the STRUCTURE analysis of the entire Alcolapia dataset described in the main manuscript, we also conducted analyses on subsets of the data. As the full-dataset STRUCTURE runs indicated differences in cluster membership between allopatric and sympatric sites, we also conducted STRUCTURE analysis on a dataset containing only individuals occurring sympatrically in Lake Natron (sites 005 and 011) and ran these analyses separately and combined, with a minimum of 500kb between SNPs and the following parameters: site 05: 12 individuals; 2,180 SNPs; λ=0.8; K=1-8; site 11: 12 individuals; 2,160 SNPs; λ=0.8; K=1-8; site 05 and 11 combined: 24 individuals; 2,227 SNPs; λ=1.1; K=1-16. As the Alcolapia dataset indicated separation within the A. alcalica samples (see Results section), we also repeated the analysis on a separate dataset containing only A. alcalica samples (38 samples, 2,222 SNPs), with an obtained value of λ=0.7184, for five independent runs at each K value of K=1-8.

Generation of a genome-wide SNP dataset using RAD Sequencing
A total of 83.6 Gb of sequence was produced, of which 89% successfully mapped to the O. niloticus reference genome in the alignment stage. Mapping rates were generally consistent across species, although slightly lower for A. grahami than the other species (mean mapping rates: O. amphimelas: 89.9%; A. alcalica: 88.8%; A. latilabris: 88.7%; A. ndalalani: 88.8%, A. grahami: 86.7%). Individual mapping rates per sample are given in Table S1. Following removal of duplicates and poorly paired reads, the dataset comprised 272,360,470 paired reads. From these filtered reads, a total of 1,229,734,617 calls were generated using the GATK UnifiedGenotyper (mean: 12,809,736;range 313,316,544 calls). Five samples with fewer than 2.5 million calls were removed from further analysis, resulting in a dataset of 91 samples (84 Alcolapia species and 7 O. amphimelas). Quality filtering (see Methods for specific parameters) resulted in a final dataset of 28,560,699 retained sites, representing ~3% of the reference genome. The final dataset comprised 544,916 sites that were variable across the whole alignment including O. niloticus (1.91%), of which 238,203 sites were variable across Alcolapia samples and 164,014 sites were variable across A. alcalica samples alone.
A separate dataset was compiled from reads that did not map to the O. niloticus reference genome, as such reads are likely to lie in more divergent regions of the genome. After quality filtering as for the mapped reads, the dataset contained a total of 436,839 sites, of which 5,832 (1.34%) were variable across the 91-sample dataset. The combined dataset of mapped and unmapped reads resulted in a total alignment of 550,748 variable sites (fewer than that from the mapped dataset alone as the O. niloticus reference sites were removed before combining).

Additional STRUCTURE analysis
The STRUCUTRE analysis for the full dataset (presented in the main paper) gave optima for the K value of K=3 and K=4. When the Lake Natron sympatric sites were analysed separately ( Figure S7), both site 05 and site 011 had an lnP(K) optimum of K=1 and an Evanno method optimum at K=2, but the cluster membership differed between sites, with the site 05 analysis indicating most likely membership to a single cluster across all species, while site 011 indicated different cluster membership proportions by species. When analysed jointly, optima were seen at K=2 and K=4, with the latter K value indicating differing cluster membership between species, and between populations with A. alcalica ( Figure S7). Given the separation of northern and southern A. alcalica populations indicated by both the phylogenetic inference and STRUCTURE analyses, we also conducted STRUCTURE analysis on a dataset including only the A. alcalica samples (dataset Q) in order to disentangle population clustering within this species. Replicate runs gave an optimum of K=1, suggesting that populations within A. alcalica are not sufficiently diverged to detect population structure within the species at this scale ( Figure S8).

Effect of data quality filtering on phylogenomic signal
As in previous analyses of very young cichlid radiations (Wagner et al. 2013), we found that inclusion of maximal data from RAD sequencing gave the best phylogenomic resolution, and that removal of additional data in filtering and QC steps (e.g., imposing thresholds for maximum amount of missing data, using higher SNP quality thresholds or higher threshold for individual inclusion) resulted in poorer phylogenomic resolution in ML analysis. This reflects phylogenetic analysis of simulated RAD data, showing that missing data do not appear to negatively impact phylogenetic accuracy, but low levels of informative data may do so (Rubin et al. 2012). Thus, we finally used a dataset imposing minimal threshold for missing data (including all sites with data for >2 individuals), a moderate SNP quality threshold (20) and removed only those individuals with considerably lower sequencing quality than the rest of the dataset (<2.5 million sequenced reads; a total of five individuals). While it may seem intuitively preferable to use maximal stringency in filtering to ensure optimally clean data, several recent analyses have revealed that such stringency may obscure phylogenomic signal in noisy datasets, removing important information in the form of outlier data (Huang & Knowles 2014), and that increased stringency in filtering thresholds may reduce accuracy of downstream analyses e.g., (Hoffman et al. 2014)

Phylogenomic inference using next generation sequencing (NGS) datasets
Despite the considerable increase in phylogenetically informative data from NGS, difficulties remain in finding the best way to analyse such large datasets. Of note, the RAD methodology was originally developed for population-level studies (Baird et al. 2008;Hohenlohe et al. 2010) and have only recently begun to be used in phylogenetic analysis. The more recent studies that investigate phylogeny using RAD data have typically employed phylogenomic inference using matrices of concatenated sequence data from resultant RAD loci rather than using SNP datasets (e.g., (Wagner et al. 2013). Phylogenetic algorithms have typically been developed for sequence data and do not handle concatenated SNP data well as it contains no invariant sites (Bertels et al. 2014), and may violate assumptions of rate heterogeneity, especially as large datasets often preclude the use of partitioning programs to assign different substitution models to different regions. Using the entire sequence data from RAD sampling (rather than only the variable sites) becomes computationally intensive with increase in taxon number, and it seems that taxa number rather than alignment length is the limiting factor in such analysis. For example, our reduced-taxon dataset (n=25; 26 million bp) took 3 days to complete analysis, whereas the full-taxon dataset (n=92; 28 million bp) would have taken >6 months to complete.
Thus, using only variable sites that are phylogenetically informative seems an intuitive way by which to reduce the dataset to feasible size. However we found that using a SNP-only dataset reduced the confidence of ML branch partitions, and at least one species-level node (A. grahami) was unable to be placed with any confidence, even when using the RAxML ASC_ model (ascertainment bias correction) suggested for SNP datasets, while this node was robustly resolved in the analysis of the full sequence data ( Figure 2 in the main text).
As our reduced-taxon dataset was assembled from the individuals with highest sequence quality in each population, we also checked that levels of missing data in the full-taxon SNP-only dataset were not impacting results. However, imposing maximum thresholds of missing data in SNP datasets (maximum 50% and 0% missing data) did not increase bootstrap support or resolve species relationships more robustly (results not shown).
It has been suggested that ascertainment bias correction may not be appropriate for SNP alignments (Pettengill et al. 2014), although employing GTRGAMMA with and without the ASC model in our dataset produced very similar results ( Figures 2C and S3). Though it should be noted that the results are not directly comparable as these models were run on different datasets given the removal of ambiguous bases from the alignment for use of the ASC_ model, as the ASC_ model will not run on alignments containing invariant sites, and sites that contain ambiguous bases are considered invariant if the ambiguous base could be determined to be the same as nonambiguous bases at that site, and there are no other polymorphic bases at that site (RAxML 8 manual). While ambiguous bases in Sanger-produced sequences typically indicate lack of confidence in base-calling rather than genuine heterozygosity, those in NGS datasets are usually called only if reaching a minimum coverage in both alternative bases and so are more likely represent a heterozygous locus, and so certain assumptions from existing phylogenetic models may well be unsuited to such SNP datasets.
While we highlight some of the difficulties here in the use of NGS data for phylogenetic purposes, it is undoubtedly the case that these datasets have already provided resolution to previously unanswered questions, and the continued updating of existing phylogenetic models with SNP-only datasets, along with the development of SNP-specific protocols (e.g., SNAPP program used in our analysis) will continue to refine and enhance the inferences that can be made from such datasets. Figure S1. Catchment area of the Natron-Magadi basin.

SI Figures
Hydrological system and extent of palaeolake and modern lake boundaries. Modified from (Roberts et al. 1993).
K e n y a T a n z a n i a Catchment area Palaeolake basin Existing lake basins Volcano     Dataset: Lake Natron Alcolapia sites 5 and 11 only (n=24), 500kb between sites, 2,227 SNPs Models: Admixture / allele frequencies correlated; No prior population information; λ=0.    Figure S11. Mean inter-specimen uncorrected p-distance.
Comparison of filtered RAD data from aligned and de novo-assembled reads. A) Comparisons within Alcolapia. B) For comparison, the distance to the outgroup (O. amphimelas). Different scale axes are used for each graph (the first entry in each graph is for the same comparison, but at different scales). Error bars are +/-SEM.

18
SI Tables. All tables are also provided in editable (Excel) format, available in a separate file in the online version of this article.