The puzzling phylogeography of the haplochromine cichlid fish Astatotilapia burtoni

Abstract Astatotilapia burtoni is a member of the “modern haplochromines,” the most species‐rich lineage within the family of cichlid fishes. Although the species has been in use as research model in various fields of research since almost seven decades, including developmental biology, neurobiology, genetics and genomics, and behavioral biology, little is known about its spatial distribution and phylogeography. Here, we examine the population structure and phylogeographic history of A. burtoni throughout its entire distribution range in the Lake Tanganyika basin. In addition, we include several A. burtoni laboratory strains to trace back their origin from wild populations. To this end, we reconstruct phylogenetic relationships based on sequences of the mitochondrial DNA (mtDNA) control region (d‐loop) as well as thousands of genomewide single nucleotide polymorphisms (SNPs) derived from restriction‐associated DNA sequencing. Our analyses reveal high population structure and deep divergence among several lineages, however, with discordant nuclear and mtDNA phylogenetic inferences. Whereas the SNP‐based phylogenetic hypothesis uncovers an unexpectedly deep split in A. burtoni, separating the populations in the southern part of the Lake Tanganyika basin from those in the northern part, analyses of the mtDNA control region suggest deep divergence between populations from the southwestern shoreline and populations from the northern and southeastern shorelines of Lake Tanganyika. This phylogeographic pattern and mitochondrial haplotype sharing between populations from the very North and the very South of Lake Tanganyika can only partly be explained by introgression linked to lake‐level fluctuations leading to past contact zones between otherwise isolated populations and large‐scale migration events.


| INTRODUC TI ON
With an estimated number of 3,000-5,000 species, the Cichlidae represent what is perhaps the most species-rich family of teleost fishes (Turner, Seehausen, Knight, Allender, & Robinson, 2001). Throughout their range, but particularly in the East African Great Lakes, cichlid fishes have repeatedly undergone adaptive radiation and explosive speciation and are thus well-known model systems to study these processes (see, e.g., Kocher, 2004;Salzburger, 2009;Santos & Salzburger, 2012). Within the Cichlidae, the "modern haplochromines" (sensu Salzburger, Mack, Verheyen, & Meyer, 2005) represent the most species-rich lineage. They supposedly originated in the area of Lake Tanganyika and subsequently colonized other water bodies in Africa, thereby seeding the adaptive radiations of lakes Malawi and Victoria, among others (Koblmuller, Sefc, & Sturmbauer, 2008;Salzburger et al., 2005;Verheyen, Salzburger, Snoeks, & Meyer, 2003). It is believed that habitat generalist species were the ones who colonized lakes via a series of temporal river connections, thus transporting genetic polymorphisms across large areas in East Africa (Loh et al., 2013;Malinsky et al., 2015;Salzburger et al., 2005).
Despite the species' application as research model since almost seven decades (e.g., Leong, 1969;Wickler, 1962), little is known about the ecology and behavior of this species in nature, and there is a lack of knowledge on its spatial distribution and phylogeography. Such information is crucial, however, to understand the biology of a species and to interpret laboratory-based experimental results.
Moreover, the geographic origin and genetic relationships of A. burtoni laboratory strains used in different studies are in many cases not reported or unknown.
Previous work, focussing on the adaptive divergence of A. burtoni from lake and stream habitats, already reported high levels of genetic diversity in mitochondrial DNA (mtDNA) and microsatellite markers among populations examined from the southern part of Lake Tanganyika, as well as a deep split between populations from the eastern shoreline, the western shoreline and the headwaters of the Lufubu River (Theis, Ronco, Indermaur, Salzburger, & Egger, 2014). The observed distribution of the main mtDNA haplotype lineages was interpreted to reflect past lake-level oscillations (Theis et al., 2014). Such fluctuations in the lake level, caused by variation in hydrology through time (Cohen, Lezzar, Tiercelin, & Soreghan, 1997;McGlue et al., 2010;Scholz et al., 2007), have previously been documented to affect population dynamics in rock-dwelling, littoral cichlid species from lakes Tanganyika (Baric, Salzburger, & Sturmbauer, 2003;Koblmüller et al., 2011;Sturmbauer, Baric, Salzburger, Rüber, & Verheyen, 2001) and Malawi (Genner, Knight, Haesler, & Turner, 2010). In a follow-up study based on single nucleotide polymorphisms (SNPs) derived from genomic DNA (via restriction site-associated DNA sequencing; RADseq), we confirmed a deep divergence in A. burtoni populations in the South of Lake Tanganyika, in this case, however, between the Lufubu River and all remaining populations including the fish sampled at the estuary of the Lufubu River . Taken together, previous studies not only cover a small fraction of the distribution range of A. burtoni, but revealed somewhat conflicting results with respect to population structure in this species.
In this study, we examine the population structure and phylogeographic history of A. burtoni throughout its entire distribution range in the Lake Tanganyika basin. To this end, we extend our population sample to now include specimens collected within the lake and in inflowing rivers along the entire shoreline of Lake Tanganyika and reconstruct phylogenetic relationships based on sequences of the mtDNA control region (d-loop) as well as thousands of genomewide SNPs derived from RADseq. We then explore the population structure via nearest neighbor haplotype co-ancestry analyses. Finally, by including samples from different laboratory strains in phylogenetic and population genetic analyses, we trace back their origins from wild populations.

| Study sites, sampling, and DNA extraction
Sampling was carried out between February 2010 and November 2015 in the Zambian, Tanzanian, and Burundian parts of Lake Tanganyika and inflowing rivers, as well as in Lake Cohoha F I G U R E 1 Photograph of a male Astatotilapia burtoni from Lake Cohoha, Burundi (Burundi) and Lake Chila (Zambia) (Figures 2 and 3 Burundi (collected in 1975;Fernald & Hirata, 1977b), and one bred at our own laboratory and derived from a laboratory stock established by O. Seehausen. H. Hofman/S. Renn provided an additional set of five wild-caught samples. In total, we gathered samples from 33 locations and two laboratory strains (see Table   S2 for details). All fish collected by the authors of this study were anesthetized with clove oil prior to handling; all specimens were photographed, sized, weighted, and sexed, and a fin clip was taken as DNA sample and stored in 96% ethanol. DNA extraction was performed with the E.Z.N.A. ® Tissue DNA Kit (Omega Biotek ® ) according to the manufacturer's protocol.

| Mitochondrial control region sequencing and analysis
Amplification of a 374-bp fragment of the mitochondrial control region (d-loop) was conducted using published primers (L-Pro-F and TDK-D; Kocher et al., 1989;Salzburger, Meyer, Baric, Verheyen, & Sturmbauer, 2002) and following a published protocol (Theis et al., 2014). PCR products were purified with Exo-SAP-IT (USB) and Sanger-sequenced on an ABI 3130xl genetic analyzer using the BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems). Sequences obtained in this study (n = 62; available at GenBank under the accession numbers MG987216-MG987279) were supplemented with available data from previous work (Salzburger et al., 2005;Theis et al., 2014;Verheyen et al., 2003), leading to a data set containing mtDNA sequence information of 428 specimens. DNA sequences were aligned using CODONCODE ALIGNER (v.3.5; CodonCode Corporation) and MAFFT (http://www.ebi.ac.uk/Tools/msa/mafft/). FaBox (Villesen, 2007) was applied to collapse sequences into haplotypes. We then used FITCHI (Matschiner, 2016) to construct an unrooted haplotype genealogy following the method described in Salzburger, Ewing, and von Haeseler (2011) with increased node sizes relative to the branches and population-specific haplotypes (-m 2 and -p option).

| RAD library preparation and sequencing
For RAD sequencing, we selected one to five individuals per sampling location and obtained a total of 150 individuals from 29 locations and including both laboratory strains. Libraries were prepared according to the protocol described in Roesti, Hendry, Salzburger, and Berner (2012). In short, a DNA concentration of 20 ng/μl was used for library F I G U R E 3 Map of LT showing sampling locations and nuclear phylogeny based on RADseq. Populations sampled at the shorelines of Lake Tanganyika (n = 31), Lake Cohoha (LCB, n = 1), and Lake Chila (LCZ, n = 1; full names of localities are given in Table S1). The unrooted maximum-likelihood tree based on 19,037 SNPs and 117 individuals shows a deep split between northern and southern lineages. Colors in the phylogeny correspond to the colors on the map; bootstrap support of nodes is given in per cent. Note that samples from locations 17, 20, 21 and 29 were included for mtDNA analysis only Kalambo preparation allowing for a deviation of ±1 ng/μl. Genomic DNA was digested using the restriction enzyme Sbf1 and 5-mer barcoded followed by subsequent P1 adapter ligation. After barcoding, 38-40 individuals were pooled per RAD library. The DNA was sheared to an average size of approx. 500 bp using a Bioruptor UCD-300 and cleanup was performed using MinElute™PCR purification kit (Qiagen).

| RAD data processing
The obtained RADseq reads were quality filtered, sorted according to

| Phylogenetic analyses
After consensus genotype calling, SNP matrices were generated and converted to FASTA file format applying quality filtering. Only a single SNP with the highest minor allele frequency was allowed per RAD tag. SNPs with more than 20% missing data across all individuals were eliminated, and all individuals with more than 75% missing data dropped out likewise. We generated two different SNP matrices for phylogenetic analyses. The first dataset comprises A. burtoni samples from wild populations only ("SNP matrix wild"; 19,037 SNPs and 117 individuals). In the second SNP dataset, we included one specimen each of Haplochromis paludinosus (Greenwood, 1980), Haplochromis falvijosephi (Loret, 1883), and Astatotilapia calliptera (Günther, 1893) as outgroup taxa, plus the two laboratory strains and additional "wild" samples provided by the University of Texas ("SNP matrix lab_OG" comprising 20,892 SNPs and 132 Individuals; MAF = 0.01). We chose multiple riverine haplochromine species as outgroup taxa because of the uncertain sister-group relationships among riverine haplochromines (see, e.g., Meyer et al., 2015;Salzburger et al., 2005). Note that in both matrices the samples from Kigoma, Tanzania (KIG (5)), dropped out due to poor quality.
Maximum-likelihood trees were generated in R (version 3.2.2) using the phaNgorN package (Schliep, 2011). The appropriate phylogenetic model (GTR + G) was selected via jmodelteSt (Posada, 2008), and a bootstrap analysis with 200 replicates was performed.
The R package ape (Paradis, Claude, & Strimmer, 2004) was then used to visualize the phylogenetic tree.

| D-loop haplotype genealogy
The

| Phylogenetic reconstruction based on RAD data
The maximum-likelihood analyses for each of two datasets resulted in well-resolved and congruent topologies (see Figure 3 for the topology with wild samples only and Figure S1 for the topology including laboratory strains and outgroups).

| RAD co-ancestry matrix
The clustered co-ancestry matrix with FiNeRAdStructure (Figure 4) confirmed the deep split between the northern and southern lin-

| D ISCUSS I ON
In this study, we surveyed the phylogeographic history of the haplochromine cichlid species A. burtoni, a habitat generalist occurring within Lake Tanganyika as well as in inflowing rivers, and tested for genetic substructuring in the natural populations of this widely used model species.
Our phylogenetic reconstructions based on roughly 20,000 SNP markers derived from RADseq provide an unprecedented resolution of the phylogenetic relationships among different A. burtoni populations across the entire distribution range of this species. The SNPbased phylogenetic hypothesis uncovers an unexpectedly deep split in A. burtoni, separating the populations in the southern part of the Lake Tanganyika basin from those in the northern part (Figure 3). This deep divergence is in line with the observed high levels of shared ancestry among individuals within both the southern and the northern lineages and the very low levels of shared ancestry between these two clades ( Figure 4).

Interestingly, in both the southern and the northern clades of
A. burtoni, representatives of riverine populations occupy the most ancestral positions in the phylogeny. In a recent study examining the patterns of genome divergence between lake and river populations of A. burtoni in four river systems in the South of Lake Tanganyika , we found that the Lufubu River fish (LF2) are distinct from the remaining populations examined in that study.
Moreover, it was shown that individuals from the Lufubu lake population (LFL) share similar levels of co-ancestry with individuals from their own population as with specimens collected at LF2; however, whereas LFL individuals also share co-ancestry with the other lake and stream populations in the area, this is not the case for individuals from LF2 (see fig. 2 in Egger et al., 2017; Figure 3 of this study). The inclusion of specimens from 15 additional sampling localities in the southern part of Lake Tanganyika did not change these findings (Figures 2 and 3), corroborating that the A. burtoni populations in the South of Lake Tanganyika were originally colonized from Lufubu River stocks. In the northern clade, the specimen from Kalemie (which is associated with the Lukuga River) occupies the most ancestral branches, suggesting that A. burtoni have colonized the northern part of Lake Tanganyika starting from the Lukuga River. The Lukuga River is the only intermittent outflow of Lake Tanganyika connecting the lake to the Congo drainage via the Lualaba River at periods of high lake-level stands (Cohen et al., 1997;Coulter, 1991;Lezzar et al., 1996). Astatotilapia burtoni is known to occur in the Lukuga River as far as 100 km downstream of its outlet at Kalemie (Kullander & Roberts, 2011;Poll, 1956), but has not been found downstream of the Niemba Falls. At present times, there is no connection between the Lufubu River and the Congo drainage.
However, a past connection enabling faunal exchange between the Lufubu headwaters and the Congo system during extreme flooding or river capture events has previously been proposed (Koblmuller, Katongo, Phiri, & Sturmbauer, 2012;Koch et al., 2007). It thus seems plausible that A. burtoni originated in the upper Congo/Lufubu area and spread from there via the Lukuga toward the central and northern part of the Lake Tanganyika basin and via the Lufubu toward the lakes' southern end. Although we refrain from performing a molecular clock analysis for A. burtoni here due to the lack of reliable external calibration points, previous demographic analyses provide a hint toward the temporal framework for the evolution of A. burtoni. Our previous analyses revealed that the A. burtoni populations from Lufubu River (LF2) and from the lake site near the estuary of the Lufubu River (LFL) diverged between 161-213 ka . That the here reported split between the southern and northern clade of A. burtoni is much deeper than the split between LF2 and LFL (Figure 2) suggests that the two main clades in A. burtoni diverged much earlier than ~200 ka.  1-6, 9). Thus, there is one haplotype lineage with a quite restricted geographic distribution (1), whereas another one shows a more or less lakewide distribution (3), whereby its southern range of occurrence is flanked-at both the eastern and the western shores of Lake Tanganyika-by populations belonging to a third lineage (2). In the area of the Lufubu River, representatives of all three haplotype lineages meet in close geographic proximity. It is of note that there is not a single A. burtoni population in our sample in which we found mtDNA sequences belonging to two different major haplotype lineages.
That some of the southern populations show quite distinct mtDNA haplotypes has already been reported in a previous study (Theis et al., 2014) and interpreted as being due to an underwater ridge around Wonzye Point (WON)/Crocodile Island (CRO) that might have acted as migration barrier at lake-level lowstands between the southeastern and southwestern populations. Surprisingly, the lakewide sampling of the present study revealed mtDNA haplotype sharing between populations from the very North and the very South of Lake Tanganyika, which are more than 600 km apart from each other. For example, the most common haplotype in the South (haplotype 4) has also been found in specimens from Bujumbura (RUL) and Lake Cohoha (LCB), suggesting a rather recent connection between these populations, at least of their females. Given the deep nuclear DNA (ncDNA) divergence between the northern and southern lineages, this pattern in mtDNA is difficult to explain. On the other hand, evidence is accumulating that the replacement of mtDNA across large geographic distances, without apparent signatures of nuclear genomic admixis is more common than previously thought (e.g., Good, Vanderpool, Keeble, & Bi, 2015;Melo-Ferreira, Seixas, Cheng, Mills, & Alves, 2014;Nevado, Fazalova, Backeljau, Hanssens, & Verheyen, 2011;Tang, Liu, Yu, Liu, & Danley, 2012).
More general, discordance between nuclear and mtDNA phylogenetic inferences is known from many freshwater fish taxa and attributed to their high propensity to hybridize (see Wallis et al., 2017).
In particular, in stenotopic, littoral cichlids from Lake Tanganyikasuch as Eretmodus cyanosticus, Tropheus moorii and Variabilichromis moorii-such mtDNA/ncDNA discordance patterns due to introgression/hybridization have been linked to lake-level fluctuations leading to past contact zones between otherwise isolated populations and large-scale migration events (Koblmüller et al., 2011;Nevado, Mautner, Sturmbauer, & Verheyen, 2013;Sefc, Baric, Salzburger, & Sturmbauer, 2007;Sturmbauer et al., 2001). In the genus Tropheus, for example, populations from opposite shorelines in the central and southern basin of Lake Tanganyika have been shown to share identical mtDNA haplotypes (Sturmbauer, Koblmuller, Sefc, & Duftner, 2005;Sturmbauer et al., 2001). It is thus possible that severe lake-level drops in the past could also have enabled migration of A. burtoni across the western and eastern shorelines as well as across the Central/Northern basin at times when Lake Tanganyika was either split into three separate basins or these basins were only Our analyses revealed other puzzling results regarding the phylogeography of A. burtoni. For example, we had previously noticed that the populations in the Kalambo River are not monophyletic, as the specimens collected from a population upstream the ~220 m Kalambo Falls (KA3) turned out to cluster with the specimens from Lunzua River . The inclusion of an additional population sample from further upstream the Kalambo Falls (KA4) confirms this finding (Figure 2), suggesting past migration between the upper Kalambo and the Lunzua River via a past river connection, probably triggered by tectonic movements leading to river capture events (see Cohen et al., 2013;Delvaux, Kervyn, Vittori, Kajara, & Kilembe, 1998). Our previous work revealed that fish collected from the Kalambo River downstream the Kalambo Falls (KA1, KA2) and at a lake side near the river mouth (KAL) form a clade Theis et al., 2014), which led us to suggest that the more downstream populations were seeded by lake fish and that the Kalambo Falls form a barrier to gene flow. The present study, however, contains a specimen collected from the pool just below the Kalambo Falls (KBF), which clusters with the upstream populations KA3 and KA4 in the SNP-based phylogeny (Figure 3). This implies that at least one individual must have survived a drop of more than 220 m (alternatively, a mouthbrooding female might have fallen down and the incubated eggs or larvae survived the plunge).
This study is also the first to report a pure lake population of A. burtoni in Lake Tanganyika that has no direct access to a nearby river via a stretch of shoreline. Astatotilapia burtoni has previously been reported to occur in habitats such as marshy marginal ponds or lagoons, always in association with inflowing rivers (Fernald & Hirata, 1977a). Our own previous work has challenged this view in that we investigated many lake populations and showed that lake fish are phenotypically and ecologically distinct from river fish Theis et al., 2014Theis et al., , 2017. At Crocodile Island (CRO), which is situated about 1.2 km away from the closest (southeastern) shoreline, A. burtoni are found in a water depth of 5-8 m, indicating that A. burtoni can survive and maintain populations in a proper lake habitat.
The SNP-based phylogeny further indicates two likely cases of human-mediated translocation of A. burtoni from Lake Tanganyika into other water bodies. The close genetic relationship between the Lake Chila population (LCZ) and the populations around Mpulungu (KLU, FID), as already discussed in Theis et al. (2014), is most likely due to recent translocation. Lake Chila, a small and shallow lake at Mbala, approximately 30 km south of Mpulungu, Zambia, has regularly been stocked in the past (see Theis et al., 2014). Similarly, the sister clade relationship between samples from Lake Cohoha (LCB) and the Ruzizi estuary (RUL) indicates human-mediated translocation of A. burtoni from Lake Tanganyika into the Lake Cohoha system, about 135 km away from Lake Tanganyika. Note that Lake Cohoha is not connected to the Lake Tanganyika drainage but belongs to the Nile system, and native haplochromine cichlids in that area have previously been associated with the fauna of the Lake Victoria region (Verheyen et al., 2003  sample). Since several decades, A. burtoni is a laboratory model for various research fields such as developmental biology, neurobiology, genetics and genomics, and behavioral biology (see, e.g., Baldo et al., 2011;Diepeveen et al., 2013;Dijkstra et al., 2017;Egger et al., 2017;Hofmann, 2003;Juntti et al., 2016;Lang et al., 2006;Robison et al., 2001;Salzburger et al., 2008;Santos et al., 2014;Theis et al., 2012;Wickler, 1962). Given the high population structure and deep divergence among several clades in A. burtoni, different populations and/or laboratory strains might also vary with regard to the trait(s) under study. In two recent studies dealing with the genomics of sex determination in A. burtoni, Böhne et al. (2016) inferred a XX/XY system located on LG5 for the laboratory strain of the University of Basel (LAB), and a XX/XY system at LG18 for a wild population from the southern lineage (KAL). Roberts et al. (2016), using a laboratory strain that is very likely from the same source population as the one from the University of Texas (HHL), also identified a XX/XY system on LG5 but an additional ZZ/ZW on LG13. Behavioral differences between the HH laboratory strain and southern populations (LZL and KA3) were observed in a study on maternal care (Renn et al., 2009). Hence, we deem it highly relevant to report which natural population or laboratory strain was used in publications in the future.

ACK N OWLED G M ENTS
We thank Daniel Berner, Marius Rösti, and Geoffrey Fucile for help with the RADseq analyses. We further thank Yves Fermon, Russ

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
W.S. and B.E. designed the research; G.P. and B.E. performed the wet laboratory work and analyzed the data; G.P., W.S., and B.E. wrote the manuscript.

DATA ACCE SS I B I LIT Y
Raw RAD sequencing reads generated for this study are available from the Sequence Read Archive (SRA) at NCBI under the accession numbers SRX3733973-SRX3734072 (SRA Study Number: SRP 133290). Mitochondrial D-loop sequences obtained in this study are available at GenBank under the accession numbers MG987216-MG987279.