Exploring the phylogeography of a hexaploid freshwater fish by RAD sequencing

Abstract The KwaZulu‐Natal yellowfish (Labeobarbus natalensis) is an abundant cyprinid, endemic to KwaZulu‐Natal Province, South Africa. In this study, we developed a single‐nucleotide polymorphism (SNP) dataset from double‐digest restriction site‐associated DNA (ddRAD) sequencing of samples across the distribution. We addressed several hidden challenges, primarily focusing on proper filtering of RAD data and selecting optimal parameters for data processing in polyploid lineages. We used the resulting high‐quality SNP dataset to investigate the population genetic structure of L. natalensis. A small number of mitochondrial markers present in these data had disproportionate influence on the recovered genetic structure. The presence of singleton SNPs also confounded genetic structure. We found a well‐supported division into northern and southern lineages, with further subdivision into five populations, one of which reflects north–south admixture. Approximate Bayesian Computation scenario testing supported a scenario where an ancestral population diverged into northern and southern lineages, which then diverged to yield the current five populations. All river systems showed similar levels of genetic diversity, which appears unrelated to drainage system size. Nucleotide diversity was highest in the smallest river system, the Mbokodweni, which, together with adjacent small coastal systems, should be considered as a key catchment for conservation.

Quality control is critical for RAD sequencing analyses and is conducted at various stages via an analytical pipeline prior to interpreting results for meaningful biological relationships .
Double-digest RAD (ddRAD) sequencing (Peterson, Weber, Kay, Fisher, & Hoekstra, 2012) addresses several coverage issues in the original RAD protocol by replacing random shearing of fragments with a second restriction enzyme. Targeted fragments are defined on one end by a common restriction site as in standard RAD sequencing, but differ in being flanked by a less common restriction site at the other end (Peterson et al., 2012). This approach results in higher repeatability, better control over genome coverage, greater sharing of sequenced fragments, and similar sequence read proportions across individuals (Peterson et al., 2012). The additional restriction digestion may also introduce artifacts; however, as mutations in restriction sites may result in underestimation of diversity due to allele dropout (Arnold, Corbett-Detig, Hartl, & Bomblies, 2013). Indels, combined with stringent size selection, may also result in loci being dropped or included in particular individuals or populations during ddRAD sequencing (DaCosta & Sorenson, 2014).

Polyploidy complicates most genetic analyses of Labeobarbus.
Many studies of polyploids advocate analysis of non-nuclear markers or the transcriptome (Everett, Grau, & Seeb, 2011). However, a number of studies using RAD sequencing have recently tackled the challenge, particularly in tetraploid fish. Several strategies have emerged to circumvent the complicating issue of paralogy (reviewed in McKinney, Waples, Seeb, & Seeb, 2017). These include removing diallelic markers yielding more than two alleles or haplotypes per individual and excluding loci where more than half the individuals genotyped appear heterozygous . Recently, McKinney et al. (2017) suggested the HDplot approach, which compares heterozygosity at each diallelic locus across a population with read depth for each allele.
In this study, we used ddRAD sequencing of samples from across the distribution of L. natalensis to identify phylogeographic patterns and processes affecting this species. To do this, we developed a pipeline to filter sequencing artifacts and paralogs from diallelic SNP data, resulting in a high-quality genomic resource for use in Labeobarbus.

| Sample collection and DNA extraction
Samples were collected from 13 localities across the KwaZulu-Natal Province, South Africa ( Figure 1; Table S1) between March 2003 and November 2007. The localities selected represented most major drainage systems across the species distribution. Muscle and fin samples were obtained and stored in 96% ethanol at 4°C. DNA was extracted using the phenol-chloroform method (Sambrook, Fritsch, & Maniatis, 1989).
A GeneQuant™ pro RNA/DNA calculator spectrophotometer (Amersham Biosciences, Freiberg, Germany), Qubit ® 2.0 Fluorometer (Invitrogen, USA), and agarose gel electrophoresis were used to assess DNA quality and concentration. High-quality samples were sent to Beijing Genomics Institute Hong Kong Co., Limited (BGI, Hong Kong) to undergo the ddRAD sequencing protocol as per Peterson et al. (2012). In total, 23 high-quality samples were selected for analysis (Table S1).

| Library preparation and sequencing
Samples were divided into two libraries, which were digested with the restriction enzymes NlaIII (CATG/) and MluCI (/AATT) following the double-digest paired-end protocol of Peterson et al. (2012). Each individual was tagged with a unique 4-8 base pair barcode, with two replicate individuals barcoded and sequenced in separate libraries, as controls. Each library was size selected for fragments of 200-400 bp.
Final libraries were sequenced using 90-bp paired-end sequencing in a single lane of an Illumina HiSeq 2000 (Illumina Inc., USA). The resulting reads were screened for poor quality (reads with more than 50% F I G U R E 1 Distribution of samples in this study across KwaZulu-Natal with reference to a map of South Africa (top left). River names are indicated at each sampling site. The color of each sampling site corresponds to the putative population identified in this study. The shape of the symbol indicates an association to either the northern or the southern lineage. This map was produced using QGIS (QGIS Development Team, 2016. QGIS Geographic Information System. Open Source Geospatial Foundation Project. http://www.qgis.org/) and the National Freshwater Ecosystem Priority Areas (NFEPA) project (Nel et al., 2011) low-quality bases i.e., quality value ≤ 5 (E)), trimmed to remove adapters and barcodes, and demultiplexed prior to analysis.

| Bioinformatics pipeline, quality control and read mapping
Data were cleaned both with custom bioinformatic expressions and using the program process_raDtags in stacks 1.44 (Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011;Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Reads were first trimmed to a uniform read length of 80 bp to reduce the effect of sequencing error, after examination of SNP density spectra generated from the untrimmed data ( Figure S2). Quality of the bases was assessed for each sample both before and after trimming using the program FastQC (Andrews, 2010;Figures S3 and S4). After trimming, the parameters −r (rescue RAD tags), −c (clean data, remove reads with an uncalled base), and −q (remove low-quality reads) were specified in process_raDtags. Only reads that remained paired after processing were retained. Adapter pollution, remnants of adapter sequences that were not removed by earlier trimming, was filtered using custom bioinformatic expressions. The degree of overlapping reads and incomplete restriction digestion in the dataset was estimated using custom bioinformatic expressions and CLC genomics Workbench 7.0.4 (CLC Inc., Aarhus, Denmark). The latter program was also used to evaluate read quality and to map filtered reads to the closest cyprinid reference genome, the common carp (Cyprinus carpio, GCF_000951615.1). Filtered reads were similarly mapped to the set of carp coding DNA sequences (CDS) from the same reference after removing mitochondrial genes. We used a mismatch cost of 2, insertion and deletion costs of 3, and length fractions and similarity fractions of 0.9 to retain only highly plausible mappings. Nonspecific matches were ignored.

| Single-nucleotide polymorphism discovery
Due to the lack of a close reference genome, the wrapper program Denovo_map.pl was used to identify diallelic SNP loci by executing ustacks, cstacks, and sstacks (Catchen et al., , 2013. We chose to use only our Read 1 files (NlaIII cut-site) for the assembly to avoid duplication of SNPs due to overlapping reads from fragment ends and false SNPs caused by adapter pollution. Optimal parameters were identified using the method outlined in Paris, Stevens, and Catchen (2017). Briefly, we constructed plots of the average  Table S2). The --max_locus_stacks default value was set to 7 to ensure adequate binning and avoid paralogs. The −t flag was specified while running Denovo_map.pl to remove or split highly repetitive tags during ustacks. From this procedure, we identified optimal parameters for this dataset to be −m 5 −M 1 −n 0.
Mitochondrial reads were detected using nucleotide BLAST (Altschul et al., 1997)

| Population genetic parameters and structure
Output from populations was exported in STRUCTURE and GENEPOP formats and converted to other formats, as needed, using PGD spiDer (Lischer & Excoffier, 2012). STRUCTURE 2.3.4 (Pritchard, Stephens, & Donnelly, 2000) was used to infer population structure with 100,000 chains as burn-in and 500,000 MCMC chains with 20 iterations for K = 1-8. Location was specified as a prior. We followed the same protocol for further hierarchical STRUCTURE runs for the two lineages identified from the primary run. The result files were run through STRUCTURE harvester (Earl & VonHoldt, 2011), and the optimal Kvalue was determined by the method of Evanno, Regnaut, and Goudet (2005). CLUMPP 1.1.2 (Jakobsson & Rosenberg, 2007) was used to visualize the data.
In addition to STRUCTURE, RADpainter and FineRADstructure Pairwise population F ST values were estimated among populations inferred from these analyses using arlequin 3.5.1.2 (Excoffier & Lischer, 2010). Also, the contribution of loci under selection on the observed population structure was assessed by identification of F ST outlier loci using BAYESCAN 2.1 (Foll & Gaggiotti, 2008) and comparing analyses of structure based on all loci, and excluding outlier loci. Finally, we identified a prevalent HindIII satellite using nucleotide BLAST (E ≤ 1 × 10 −10 ) against sequenced monomers available in cyprinids Acrossocheilus paradoxus (AJ241977.1) and C. carpio (M19418.1) and similarly assessed its contribution to population structure.

| Population history
DIYABC 2.1.0 (Cornuet et al., 2014) was used to test a number of simple evolutionary scenarios. The dataset was reduced to 661 SNPs by excluding SNPs with missing data for entire populations. One million simulated datasets were generated per scenario at each stage of the scenario testing. We first tested six basic scenarios of a basal split separating one population from all others or a basal polytomy ( Figure S7A). We then tested 24 ladder-like scenarios of successive divergence ( Figure S7B). We also tested ten scenarios where the ancestral population split into two major lineages which then diverged into five regional populations ( Figure S7C). Finally, we compared the best supported scenarios from the above tests as well as two variants that model the Umgeni population as a product of admixture ( Figure   S7D). Scenarios were assessed using logistic regression analysis (1% of total datasets) comparing observed versus simulated values of standard summary statistics of genic diversity, population structure, and Nei's distances with all other settings at the default values for SNP datasets.
The average nucleotide diversity across all loci (π = 0.0035) was used to determine the long-term effective population size (N e ) using the equation from Tajima (1983): π = 4*N e *μ. The mutation rate per site per generation (μ) is calculated using a rate of between 1 × 10 −8 and 1 × 10 −9 per site per year in SNPs (Brumfield, Beerli, Nickerson, & Edwards, 2003).

| Library preparation and sequencing
Illumina 90-bp paired-end sequencing produced over 137 million reads (Table 1)

| Bioinformatics, mapping, and SNP discovery
Initial quality control of the data identified over 52 million high-quality paired reads without adapter pollution. Mapping reads against the common carp reference genome resulted in an average of 830,241 reads mapped per individual ( Table 2). The average coverage of mapped reads was 2.6% of the C. carpio genome. In contrast, mapping against C. carpio nuclear coding sequences yielded an average number of 55,844 mapped reads per individual with an average coverage of 3.2% of the C. carpio CDS reference.
Over 66% of the paired reads were overlapping, indicating inefficient size selection. Additionally, some of the consensus fragments were so short (<90 bp) that sequencing had extended into the adapters and barcode at the 5′ end. Reads contaminated by adapter pollution were filtered with custom scripts using regular expressions based on restriction enzyme recognition sequences. We observed a high degree of incomplete enzyme digestion, with 9% of reads containing more than one of the same restriction site.
Selecting only the first read from each pair gave the current list of 50,740 loci and 17,256 SNPs identified through the stacks pipeline (  Table 3 and distribution of SNPs across individuals in Figure S8.

| Population structure and variation
Analysis of 723 loci with STRUCTURE revealed a well-supported split at K = 2 between the northern and southern populations (Figure 3; F I G U R E 3 STRUCTURE analysis with K = 2-5 using 723 high-quality filtered loci. Each individual (indicated as columns along the X-axis) is probabilistically assigned (probability of assignment q on the Y-axis) to one of the inferred genetic clusters. Location was specified as prior. K = 2 was recovered as having the most support with the Evanno method (Evanno et al., 2005). CLUMPP was used to produce this representation from 20 replicates Figures S10 and S11). Individuals from the lower Umgeni (geographically between the Mbokodweni and Tugela drainage systems) appeared as potentially admixed individuals between these northern and southern lineages. Some evidence of further structure at K = 3-5 is present, but this is not as well supported. However, hierarchical STRUCTURE ( Figure 4) showed further subdivision into five populations: Umfolozi, Tugela, Umgeni, Mbokodweni, and Mkomaas, from north to south. These results were not well supported (Figures S12-S15). The two samples from the Mzimkhulu system, near the southern distribution limit, were not distinguished from those in the adjacent Mkomaas, and we treat these here as a single population. Similarly, the single sample from Lions River, a tributary of the upper Umgeni, clustered closely with those from the Tugela system, rather than the lower Umgeni, and was considered part of the Tugela population in subsequent analyses (see Discussion below).
FineRADstructure was used to generate a co-ancestry matrix and tree ( Figure 5) showing five populations-the Umfolozi, Tugela, Mkomaas, Mbokodweni, and Umgeni. Further support for these populations was shown in PCA and FCA plots from FineRADstructure ( Figure 6) and genetix (Figure 7). The plots also display variance within populations. The Umgeni was shown to be similar to both neighboring populations (Tugela and Mbokodweni) and was plotted between these two groups.
Positive selection may potentially affect phylogeographic analyses.
Consequently, loci in this dataset were filtered for F ST outliers using Sample number Sample number Sample number

| Population history
The best supported scenario from DIYABC analysis involved a split into northern and southern lineages followed by subdivision into three northern and two southern populations (Figure 9). Other scenarios received similar high support, such as a latitudinal series of splits, either from north to south or vice versa, or a split into two northern and two southern populations with admixture in the Umgeni system ( Figure   S7D). The effective population size is estimated to lie between 87,500 and 875,000 individuals.

| Sequencing and mapping
The average GC content of the reads obtained through RAD sequencing, between 38.5% and 40.8%, was similar to that reported for the zebrafish, Danio rerio, genome (38.6%) (Zhou, Bizzaro, & Marx, 2004) giving confidence that these data were not particularly biased by our . Initial examination of the SNP data, prior to filtering, showed many SNPs in the last few base pairs of reads ( Figure S2) that may reflect sequencing errors rather than true variants (Pujolar et al., 2013).
Trimming removed most of these errors, as is apparent from the SNP density spectrum ( Figure S2), with further errors removed by filtering for singleton allele SNPs and excess heterozygosity.
A general reason for the unexpectedly high genome coverage observed is the paralogous origins of most loci in L. natalensis when mapped against lower-ploidy species such as C. carpio. As the hexaploid lineage L. natalensis has 150 chromosomes (Oellermann & Skelton, 1990) versus 100 in the tetraploid lineage C. carpio (Ráb, Pokorný, & Roth, 1989), a larger amount of the genomic information in the former species originates from shared ancestral sequence duplications. Therefore, the high percentage of mapped reads seen here undoubtedly includes some error from merging of paralogs. Marginally greater coverage was observed in the C. carpio nuclear CDS than in the entire nuclear genome, which suggests a bias in the presence of cut sites toward coding sequences, in agreement with the observed transition/transversion ratio. However, this may also reflect a greater likelihood of mapping reads in coding regions, which are more conserved across these distantly related taxa.  Although the same fragment size range was targeted as in Peterson et al. (2012), the high proportion of overlap between paired reads indicated imperfect size selection. This may be a common issue with genomic methods involving size selection, as low concentrations of nontarget fragments may be favored by biased amplification and sequencing of short fragments. One consequence of poor enforcement of the size threshold is that a considerably larger portion of the genome was sampled.

| Bioinformatics and SNP discovery
A paired-end ddRAD sequencing approach was used in this study to improve the fragment read depth by avoiding the random shearing step in single-digest RAD. This method offers the added benefits of requiring less genomic DNA, high repeatability within and among individuals, reduced library construction costs, and allowing highly multiplexed libraries (Peterson et al., 2012). A number of studies have also shown the method to require small sample sizes to produce highly informative population-level results (Boehm et al., 2015;Macher et al., 2015;Willing, Dreyer, & Van Oosterhout, 2012).
Some predicted drawbacks of this method are the inability of the process to combine stacks of reads as longer contigs, with consequent reduced downstream applications, and potential for bias in estimates of population parameters (Arnold et al., 2013;DaCosta & Sorenson, 2014). In practice, we did not observe these limitations, our mapping showed more extensive coverage than anticipated, and this was reflected in the number of stacks retrieved.
Overall, the method was found to produce relatively low read depths at lower −m thresholds (minimum read depth per stack) across the large number of sites identified, probably due to the high frequency of restriction sites for both enzymes, combined with incomplete digestion and size selection. These enzymes were selected to access many loci across the genome. Although efforts were made to restrict fragment size and to achieve complete restriction enzyme digestion, the retention of low levels of nontarget fragments is common to these methods and may have disproportionate influence on the sequence data. An unanticipated by-product of these technical imperfections, combined with polyploidy of this lineage, is high genome coverage at low read depth. As is typical of many genomic studies, this resulted in a large loss of data through stringent filtering for quality control. In future ddRAD studies, this could be remedied using one rare-cutting enzyme to reduce the number of fragments sequenced or to increase overall sequencing coverage to improve read depth per site. However, by specifying a minimum depth of five reads per stack, we were able to obtain loci of sufficient coverage (>25×) for population genomic analyses (Paris et al., 2017).
Because coverage is so important for identifying errors from sequencing variation, the substantial variance in loci observed when varying −m may indicate that high read depth cutoff should be favored in analyses (Mastretta-Yanes et al., 2015). However, setting the cutoff value too high would result in allele dropout, leading to further errors (Mastretta-Yanes et al., 2015). Here, we used the approach of Paris et al. (2017) to determine the optimal stacks parameters by comparing the number of assembled loci, polymorphic loci, and SNPs obtained.
This allowed us to generate intuitive graphical representations of how parameters in stacks influenced our dataset (Figure 2; Figures S5-S7; Table S2). From these plots, we chose an optimal set of parameter values for the current dataset.   Wagner et al., 2013), but this has been argued against in other studies (Henning et al., 2014). Initially, we opted for a conservative approach to minimize missing data and thereby reduce uncertainty in population analyses. However, we found that selecting low levels of missing data, at a cost of a smaller dataset, resulted in a loss of power to detect phylogeographic structure (results not shown). As the level of missing data is allowed to increase, so does  F I G U R E 8 Venn diagram illustrating the association of alleles between four populations (Mkomaas, Mbokodweni, Tugela, and Umfolozi) after samples from the potentially admixed Umgeni population are removed, resulting in 720 diallelic loci. Values in italics indicate private alleles for each of the four populations, whereas the value in bold indicates alleles found in all populations the signal of phylogeographic structure. This phenomenon has been briefly described in the literature (Campagna et al., 2015;Huang & Knowles, 2014;Takahashi, Nagata, & Sota, 2014;Wagner et al., 2013) and should be viewed as a motivation to include more missing data to reduce potential biases from only examining highly conserved regions.
However, our final datasets retained relatively low levels of missing data-19.7% within our 723 SNP dataset and 0.63% within the DIYABC reduced SNP dataset of 661 markers. The difference in missing data observed between these datasets must be due to removing SNPs which contain missing data for entire populations in the latter dataset. This suggests that most missing data we observe are due to mutations within one of the restriction sites leading to locus dropout at the population level (Arnold et al., 2013). These missing data are therefore not due to technical errors or low coverage, but real biological signal from private population mutations.
We found that some signal of finer-scale structure was being driven by mitochondrial SNPs. STRUCTURE plots generated prior to mitochondrial SNP filtering showed support for higher levels of K, depending on the level of missing data allowed (results not shown).
This was surprising given that there were only 121 mtDNA markers of 50,740 loci prior to other filters. This highlights the need for effective mitochondrial-marker filtering of RAD datasets, as these few SNPs influenced signal from all other SNPs. Additionally, we found that the presence of large numbers of singleton SNPs drowned out the signal of genetic differentiation, as observed previously by Rodríguez-Ezpeleta et al. (2016). This resulted in STRUCTURE plots with no differentiation between populations (data not shown). However, this was resolved by filtering for a minimal MAF set to remove singleton alleles.
Cyprinid genomes include extensive repetitive regions (Henkel et al., 2012) such as the HindIII satellite (Datta, Dutta, & Mandal, 1988;Tseng, Chiang, & Wang, 2008), which is prevalent in our dataset at a frequency of 4.26% prior to filtering, with 7.47% of the loci in the final dataset potentially affiliated with this satellite. The prevalence of this satellite in our data exceeds that of all satellites across the C. carpio genome (2.46%) (Xu et al., 2011). A HindIII satellite has been shown to exhibit intraspecific concerted evolution in Cyprinodon variegatus, whereby it shows low levels of variability within populations and individuals but is distinct between local populations (Elder & Turner, 1994). This is thought to occur either by genetic isolation between populations or by the propagation of new mutational variants across neighboring populations through molecular biological processes such as biased gene conversion (Elder & Turner, 1994). Similar results were recorded in A. paradoxus (Tseng et al., 2008)

| Population genetic parameters and structure
Differentiation between groups using STRUCTURE at K = 2 identified a divide between the northern and southern populations. The split between lineages appears to have occurred around Durban. Although this does not coincide with any well-recognized biogeographic or climatic boundary, it is consistent with a general transition from a speciose tropical fauna, to a highly endemic warm temperate fauna in aquatic organisms (Alexander, Harrison, Fairbanks, & Navarro, 2004;Perera et al., 2011;Seymour, De Klerk, Channing, & Crowe, 2001). being more unexpected. Similar divisions within lineages to the biogeographic regions have been observed in freshwater crabs (Gouws, Peer, & Perissinotto, 2015) and in vertebrate fauna overall (Perera et al., 2011).
The division into two broad lineages dominated our primary analysis in STRUCTURE, and subdivision into the five populations is not as well supported. However, further investigation using hierarchical STRUCTURE and FineRADstructure revealed additional structure.
Despite variation among individuals, PCA and independently generated FCA plots also clearly grouped these into five populations, mainly delimited by river systems.
The two individuals from the lower Umgeni drainage system appeared as potentially admixed samples, which may indicate  (Figures 6 and 7) which again suggests admixture.
The single sample from the upper Umgeni system (Lions River) consistently grouped closely with those from the neighboring Tugela system (Figures 3-7). This is likely a translocated individual from the  (Soltis & Soltis, 2000) due to the additional paralogs present.
The identification of five different populations in this study contrasts against the six haplogroups identified using mitochondrial data (Bloomer et al. Unpublished data). This could be due to retention of ancestral polymorphisms at nuclear loci whereas the lower effective population size of the mitochondrial genome would allow population variation fixation at a more rapid rate. Alternatively, this level of fine-scale differentiation may be beyond the current approach where major historical events could be masking signal from more recent or smaller-scale events, such as in the case of the primary STRUCTURE analysis where K = 2 was determined to be most likely. However, the most likely explanation for this incongruence is that there is gene flow occurring between these putatively isolated populations such that the finer-scale nuclear structure within the northern and southern lineages is not substantial, but the mitochondrial locus is rapidly fixed within local populations and reflects a larger degree of difference between these populations. This may suggest a flaw with popular methods such as DNA barcoding, where the sequencing of mitochondrial genes is solely used to delimit populations or even species. Further investigation is necessary to determine whether this is the case here.

| Population history
Similar levels of support were received for three models in our sce- The N e produced here is an estimate assuming random mating and constant population size and is associated with the long-term population size, not the contemporary census size. Parameters of population history were not estimated in DIYABC due to uncertainty of the priors. Because L. natalensis is distributed throughout most rivers of KwaZulu-Natal, it was assumed that the effective population size and hence nucleotide diversity would be correlated with the geographic size of the drainage systems or with the number of associated rivers in a system. This was reflected in the nucleotide diversities for the Umfolozi, Tugela, and Mkomaas systems; however, the Mbokodweni population had nucleotide diversity even greater than the larger systems of Tugela and Mkomaas despite its restricted range of a single known river. Because this is a reflection of the long-term effective population size, this may indicate that this population historically occupied a more widespread range or could suggest that we have not yet found the full extent of the distribution of this lineage across the coastal rivers in this area.
This highlights the need for more comprehensive sampling across this area to define the current range of this population. The current restricted distribution of this population as shown in this study places it as a priority for conservation purposes.

| CONCLUSION
In this study, we used the ddRAD sequencing approach to reduce the genome complexity of a South African endemic hexaploid fish. Our SNP identification protocol was optimized using the new approach of Paris et al. (2017) and extended by comparing two different approaches for paralog identification and removal, filtering of mitochondrial loci which influenced STRUCTURE results, filtering of singleton allele SNPs which masked genetic structure, and retaining satellite loci and loci potentially under selection which both showed similar results to putatively neutral loci. We demonstrated that although a moderate level of missing data was observed, it was due to locus dropout caused by lineage-specific mutations in one of the two restriction sites. We used our final dataset of 723 SNPs to characterize that two major lineages-northern and southern-diverge into five regional populations-Umfolozi, Tugela, Umgeni, Mbokodweni, and Mkomaasacross the distribution. We found some evidence for north-south The approaches we used together with the resources established in this study will aid in combating the dearth of genetic data available for Labeobarbus and other cyprinids.

ACKNOWLEDGMENTS
This study is dedicated to the memory of our colleague, the late Rob Karssing. We would like to thank the following colleagues for contributing to the sampling in this study: M Nkosi, N Rivers-Moore, J Craigie, H Plank, HE Filter, T Wilkinson, R Arderne, J Wakelin and A Howell.
We would also like to thank Vincent Savolainen, Bengt Hansson, Ulrich

Schliewen, Emmanuel Vreven, Alexander Papadopulos, Ilkser Erdem
Kiper, and their research groups for their contributions to this project.
Thank you also to anonymous reviewers who provided invaluable suggestions and guidance. This project was funded by the South African National Research Foundation (NRF; Innovation doctoral scholarship awarded to CSS and Incentive funding to PB, grant number 77240) and the University of Pretoria's Genomics Research Institute (GRI).
Opinions expressed and conclusions arrived at are those of the authors and are not necessarily to be attributed to the NRF.