Mito‐nuclear discordance across a recent contact zone for California voles

Abstract To examine the processes that maintain genetic diversity among closely related taxa, we investigated the dynamics of introgression across a contact zone between two lineages of California voles (Microtus californicus). We tested the prediction that introgression of nuclear loci would be greater than that for mitochondrial loci, assuming ongoing gene flow across the contact zone. We also predicted that genomic markers would show a mosaic pattern of differentiation across this zone, consistent with genomes that are semi‐permeable. Using mitochondrial cytochrome b sequences and genome‐wide loci developed via ddRAD‐seq, we analyzed genetic variation for 10 vole populations distributed along the central California coast; this transect included populations from within the distributions of both parental lineages as well as the putative contact zone. Our analyses revealed that (1) the two lineages examined are relatively young, having diverged ca. 8.5–54 kya, (2) voles from the contact zone in Santa Barbara County did not include F1 or early generation backcrossed individuals, and (3) there appeared to be little to no recurrent gene flow across the contact zone. Introgression patterns for mitochondrial and nuclear markers were not concordant; only mitochondrial markers revealed evidence of introgression, putatively due to historical hybridization. These differences in genetic signatures are intriguing given that the contact zone occurs in a region of continuous vole habitat, with no evidence of past or present physical barriers. Future studies that examine specific isolating mechanisms, such as microhabitat use and mate choice, will facilitate our understanding of how genetic boundaries are maintained in this system.

We then used Picard (http://www.picard.sourceforge.net) and GATK (McKenna et al. 2010) to perform realignment on alignment files in BAM format, as generated by SAMtools (Li et al. 2009). We then used SAMTools/bcftools (Li et al. 2009) to generate a raw variant call format (VCF) file that contained all potential variable and invariable sites. Using the VCF file, we calculated the mean coverage for each individual; only individuals with > 10 X coverage were retained in the dataset. To remove low quality sites (e.g. low coverage, high percentage of missing data, putative paralogs) from our dataset, data in VCF format were filtered using SNPcleaner (v.224, https://github.com/fgvieira/ngsClean) following the protocol of Bi et al. (2013). The specific flags used in SNPcleaner were -k 97 (> 70% of our sample size), -u 3 (minimum read depth coverage of each individual at a site), -d 292 (minimum total read depth at a site), -h 0 (minimum p-value for a Hardy-Weinberg Exact test), and -H 0.0001 (minimum pvalue to test for heterozygosity excess). This bioinformatics pipeline recovered a total of 3,475,028 sites, including variant and non-variant sites.

Calling SNPs and genotypes and related analyses
The procedures employed to call SNPs and assign genotypes are described in the primary text. For the NgsAdmix analysis, all of the 3,475,028 sites identified were used to calculate the likelihood of assigning an individual to a given genetic cluster. Values of -K (number of ancestral populations) ranged from one to ten; we specified -minMaf (the minimum minor allele frequency) to be 0.0038 and -minInd (minimum number of informative individuals) to be 92. Using these settings, we completed 20 runs of NgsAdmix analysis, after which we employed the Evanno et al. (2005) method to determine the optimal value of K. For NgsAdmix analysis within each clade, the -minMaf and -minInd were adjusted according to the number of voles sampled within each clade.
From the full data set of > 3 million genomic sites, we used ANGSD to identify 56,343 SNPs. These SNPs were used to generate an unrooted neighbor-joining tree depicting relationships among individual nuclear genotypes. The same markers were used in demographic models constructed using the program, ∂a∂i. These markers were analyzed using the POPFilter module in PopGenTools (https://github.com/CGRL-QB3-UCBerkeley/PopGenTools) to exclude private alleles, with flags set as: -i 0.2 (missing data allowed per sub-population) and -s 0.2 (total missing data allowed). We then used the realSFS sub-program in ANGSD to select consensus sites that existed across all populations, after which customized scripts were employed to calculate allele frequencies for each SNP and to filter out non-diagnostic SNPs using the cut-off rationale outlined in the Materials and Methods section of the primary text. After completion of these procedures, the resulting dataset contained 4,050 diagnostic SNPs that were used for analyses of cline structure and for calculations of hybrid index (Q) values.

Phylogenetic analysis
The maximum-likelihood tree depicting relationships among cyt-b haplotypes was generated with RAxML and was based on 1,000 rapid bootstrap replicates conducted through the CIPRES supercomputing facility (Miller et al. 2010). Additionally, we used jModelTest2 (Darriba et al. 2012;Guindon & Gascuel 2003) to identify the best-fit model of nucleotide substitution (F81+I+G) for our dataset. This information was used to inform the settings employed in MrBayes 3.2.6; two runs of MrBayes were completed, with each run consisting of 4 chains of 3 million generations, with sampling conducted every 100 generations.
For the BEAST analysis used to date the putative divergence event between lineages of M. californius, a data matrix consisting of one cyt-b sequence from each locality sampled (all 10 sequences were 789 bp) as well as sequences from representatives from other clades of Microtus and the Myodes outgroup was subjected to two independent MCMC runs, with chain lengths of 160 million generations each. Burn-in was set at 10% and output parameters were recorded every 16,000 generations. We used the HKY substitution model with estimated base frequencies and an uncorrelated, relaxed molecular clock. The tree prior was set to employ the Yule process. We forced the analysis to fit four monophyletic groups (the genus Microtus, all North American Microtus, the two M. agretis lineages and the two M. californicus lineages) based on previous studies that strongly support the monophyly of each of these lineages (Conroy & Cook 2000;Fink et al. 2010;Jaarola et al. 2004). Calibrations were placed at the node for the most recent common ancestor of the two M. agrestis phylogenetic lineages in Europe (Pauperio et al. 2012). The node prior was set to a lognormal distribution with a mean value of 0.015 and a Stdev of 0.2. The ucld mean prior was set to a gamma distribution with shape = 0.01 and scale = 1,000. The prior for Yule.birthRate was also set to a gamma distribution with the same shape and scale values. Default settings were used as the priors for the rest of the model parameters. Output trees from these runs were combined (10% burn-in) to generate a maximum clade credibility tree using the LogCombiner and Tree Annotator modules in BEAST v.1.8.3. Cyt-b sequences did not provide informative markers for resolving deep nodes in Microtus (Fink et al. 2010) and thus we did not attempt to determine the divergence dates for deep nodes in the trees generated by our analyses. Instead, we used these analyses to explore the likely dates for more recent divergence events, which are likely to be similar for M. agrestis and the lineages of M. californicus that are the focus of this study.
Demographic model fitting using ∂a∂i Demographic analyses were conducted using the program ∂a∂i; the parameters used in these analyses are given in Table 2. The input SFS was projected to a sample size of N = 50 for each lineage (50,50). The maximum number of iterations for each model was 20. In general, the upper bounds for population size were set to between 10 and 25. Size change upper bound was set to 5 except for the model in which this value is a fraction of original population size. Time since a lineage split or population size change had upper bounds between 10 and 15. Upper bounds for migration were set to between 2 and 5. The lower bounds for the parameters mentioned above were set between 1x10 -2 -1x10 -100 . For the uncertainty analysis, we bootstrapped the original SNP dataset to generate 100 sets that were equal in the number of SNPs in the original dataset. We then used these 100 simulated datasets to run uncertainty analyses using the Godambe method (Coffman et al. 2016) as implemented in ∂a∂i.  Table S2. Mean estimated divergence dates, 95% highest posterior density interval (HPD) and the posterior probability between selected lineages of Microtus and between Microtus and the Myodes outgroup. Estimates are from the maximum clade credibility tree for Microtus generated using BEAST. Node numbers correspond to nodes depicted in Figure S2.   Figure S1. Maximum-likelihood tree for mitochondrial cyt-b sequences from 132 M. californicus sampled during this study; data from 2 outgroup taxa (M. mexicanus and M. ochrogaster) are also included. Analyses were conducted using RAxML; relationships among main branches were identical for a Bayesian tree reconstructed using MrBayes. Numbers above branches support values from ML bootstrap/Bayesian posterior probabilities. Sequences associated with the southern versus northern clades are indicated.  Figure S2. Maximum clade credibility tree depicting the estimated divergence time between the northern and southern lineages of M. californicus. Analyses were conducted for cyt-b sequences using BEAST. Nodes labeled with dots have Bayesian posterior probabilities greater than 0.95. The node numbers correspond to those indicated in Table S3, which provides the estimated divergence date for each node of interest.  Table 1.

Node
(a) (b) Figure S4. (a) Geographic distribution of the cline center for each diagnostic SNP. Data from a total of 4,050 SNPs are shown. Each point represents a single SNP marker; the 95% CI for each estimated cline center is indicated with the vertical bars. The y-axis to the left of the figure depicts absolute distance from sampling locality 1; the y-axis on the right indicates the relative location of each sampling locality. Sampling locality numbers correspond to those in Table 1.   Figure S8. The average genomic clines estimated using relaxed (allele frequencies ≤ 0.2 and ≥ 0.8) and strict (fixed, allele frequencies = 0.0 / 1.0) cut-offs for defining diagnostic SNPs. The cyt-b cline is shown for comparison. See also Table S7.