Maintaining their genetic distance; limited gene flow between widely hybridising species of Geum with contrasting mating systems

Mating system transition from outcrossing to selfing frequently gives rise to sister lineages with contrasting outcrossing rates. The evolutionary fate of such lineages depends on the extent to which they exchange genes. We measured gene flow between outcrossing Geum rivale and selfing G. urbanum, two sister species derived by mating system transition, which frequently hybridise. A draft genome was generated for G. urbanum and used to develop dd-RAD data scorable in both species. Coalescent analysis of RAD data from allopatric populations indicated that the two species diverged 2-3 Mya, and that long term gene flow between them has been very low (M=0.04). G. rivale showed greater genetic diversity in sympatry than allopatry, but genetic divergence between species was no lower in sympatry than allopatry, providing little evidence for recent introgression. Clustering of genotypes revealed that, apart from four early generation hybrids, individuals in sympatric populations fell into two genetically distinct groups with <1% admixture that corresponded exactly to their morphological species classification. Although our data suggest limited gene flow, we observed joint segregation of two putatively introgressed SNPs in G. urbanum populations that was associated with significant morphological variation; this provides tentative evidence for rare introduction of novel genetic diversity by interspecific gene flow. Our results indicate that despite frequent hybridisation, genetic exchange between G. rivale and G. urbanum has been very limited throughout their evolutionary history.


Introduction 18
A key factor influencing the evolutionary trajectory of lineages is their level of gene exchange with 19 related taxa (Abbott et al. 2016). A reduction in gene flow facilitates speciation (Coyne & Orr 2004). 20 Limited gene flow from a related taxon may increase the genetic variability of a species and enable 21 acquisition of novel traits that enhance its evolutionary potential, while maintaining its distinctness 22 (Paoletti et al. 2006). However, if gene flow is more extensive, species may fuse (Carney et al. 2000). 23 Thus, quantifying gene exchange among lineages is crucial for understanding the birth, continued 24 evolution and possible extinction of species. 25 Quantifying gene flow between lineages is particularly valuable across lineage pairs that share an 26 evolutionary transition, allowing patterns of genetic exchange associated with that evolutionary 27 transition to be recognised. In many plant genera speciation is associated with a mating system 28 transition from outcrossing to selfing (Stebbins 1957;Igic et al. 2006;Wright et al. 2013; Barrett et al. 29 2014). Although sister species with contrasting mating systems are typically ecologically distinct, 30 hybridisation often occurs between them where habitat isolation breaks down or the ranges of the 31 two taxa meet (e.g. Mimulus (Vickery, 1978 The transition from outcrossing to selfing affects genome evolution per se (Wright et al. 2008) and is 34 commonly accompanied by changes in floral display (Sicard & Lenhard 2011). Thus the transition can 35 potentially lead both to post--zygotic barriers to interspecific gene flow, due to interactions between 36 differently evolved outcrossed and selfed genomes, and to pre--zygotic barriers, through its effect on 37 mating opportunities (Fishman & Wyatt, 1999;Hu, 2015). Quantitative measurement of 38 introgression between sister selfing and outcrossing taxa is needed to determine the strength of 39 these effects, and to ascertain whether they prevent integration of novel genetic material into either 40 species. Introgression would be particularly important for highly inbreeding taxa whose adaptive 41 predominantly self--fertilised (t = 0.15) and bears erect flowers adapted to fly pollination (Taylor 48 1997a, b; Ruhsam et al. 2010). 49 Allopatric populations of both species occur: G. rivale in tall herb montane communities in the UK, 50 and at high latitude sites elsewhere in Europe; G. urbanum in the extreme south east of the UK, 51 where rainfall is low, and at the southern extremes of its distribution in continental Europe (Taylor, 52 1997a, b). However, over most of their distributions in the UK and continental Europe the taxa are 53 sympatric. Here they can be found as single species populations in "pure" habitats or together in 54 "mixed" habitats. Typical mixed habitats include naturally wet woodlands and artificially disturbed 55 river banks (Waldren et al. 1989). 56 In mixed habitats the highly fertile F1 hybrid G. x intermedium is widely reported, associated with 57 hybrid swarms (Taylor 1997a). Detailed analysis of a single hybrid swarm indicated the presence of 58 both F1 hybrids and early generation backcrosses to G. rivale that produce fertile seeds by both 59 outcrossing and selfing (Ruhsam et al. 2011(Ruhsam et al. , 2013. The resulting recombinant offspring express no 60 detectable fitness reduction when grown in a benign environment, but some recombinant types are 61 not apparently present in the population of established adults in the hybrid swarm (Ruhsam 2013), 62 perhaps due to selection (Ruhsam et al. 2013). Thus, the mating events occurring in mixed habitats 63 are common and favourable for introgression, particularly via backcrossing to G. rivale. However, so 64 far there has been no assessment of whether this has led to significant gene exchange between the 65 two species. 66 To quantify gene exchange between the two Geum taxa we adopt a comparative genomic approach. 67 We first generate a draft genome of the inbreeding species, G. urbanum, which we use to develop a 68 set of SNP markers common to the two species. These SNPs are then scored in allopatric populations 69 of both species likely to have been geographically isolated for the past 5ky. We analyse these data in 70 a coalescent framework to estimate the age of the two taxa and measure interspecific gene flow 71 from the time at which the species formed to the time when allopatric populations became isolated. 72 We also use the allopatric samples to identify species--specific diagnostic SNPs. 73 To measure recent gene flow between the Geum taxa, in the period since the isolation of the 74 allopatric populations, we analyse an additional set of samples taken from a broad area of sympatry. 75 We look for signals of recent genome--wide introgression by comparing nucleotide diversity and 76 genetic differentiation between the allopatric and sympatric samples. If introgression has occurred, 77 significant increase in genetic diversity within and reduction in genetic differentiation between 78 species are expected in sympatric compared to allopatric samples. We applied a clustering approach 79 to the SNP data to determine the extent of admixture of G. rivale and G. urbanum genomes within 80 individuals. Finally, we score the sympatric sample for previously identified species--specific 81 diagnostic SNPs to quantify hybridisation and introgression at the level of individual genotypes. We 82 relate these estimates of the hybridity of individuals, based on molecular data, to the morphology of 83 these individuals measured in a common environment. 84 85

MATERIALS AND METHODS 86
Sampling 87 i. Sampling of Allopatric Populations 88 Four G. rivale individuals were sampled from each of three high elevation tall herb communities 89 located on three distinct Scottish mountain ranges above the altitudinal limits of G. urbanum (Table  90 S1; Taylor 1997b). One individual was also sampled from a single population in each of northern 91 Sweden and Iceland where G. urbanum is absent (Table S1; http://linnaeus.nrm.se/flora). 92 Two G. urbanum plants were sampled from each of 10 populations in south--east England (Table S1), 93 a region where G. rivale is absent due to lack of suitable habitats (Preston et al. 2002). Single 94 individuals of G. urbanum were also sampled from two areas in Europe where G. rivale does not 95 occur, namely Portugal and south--west France (Table S1;  The bioinformatics and analysis pipeline for all ddRAD data is summarized in Figure S1. To match the 136 read lengths produced by MiSeq and HiSeq Illumina technologies for ddRAD analyses, we used fastx 137 trimmer (http://hannonlab.cshl.edu/fastx_toolkit/) to trim MiSeq reads to 125bp (STACKS requires 138 reads of equal length). We then de--multiplexed and filtered reads for quality using process_radtags 139 (STACKS v 1.21; Catchen et al. 2011Catchen et al. , 2013. This process removed reads with an uncalled base, and 140 those with an average quality score below 20 over a sliding window comprising 15% of a read. 141 Finally, we again trimmed reverse reads to 117bp with fastX trimmer to produce forward and reverse 142 reads of equivalent length after the 8bp barcode was removed from the forward reads.  We characterized polymorphism within and between the Geum species using the 12 allopatric UK G. 188 rivale samples plus one from each of Iceland and Sweden, and 10 UK G. urbanum samples (one from 189 each allopatric UK population) plus one each from France and Portugal (Table S1). We used STACKS' 190 populations module to identify SNPs that are polymorphic within at least one species or show a fixed 191 difference between species. This analysis only considered radtags present in all 26 individuals 192 described above. 193

ii. Inbreeding coefficients and population differentiation 194
We applied the STACKS' populations module to estimate the inbreeding coefficients within each 195 population (F IS ) and measure differentiation among populations (F ST ) for each species using data from 196 the three allopatric G. rivale populations in Scotland (4 individuals/pop) and the ten allopatric G. urbanum populations in England (2 individuals/pop). Estimates of F IS and F ST only considered radtags 198 present in both species, and that were present in all individuals of a given analysis. 199 iii. Identification of species--specific SNPs 200 We used alternately fixed SNPs in allopatric populations of the two taxa (identified in (i.) above) to 201 provide an initial list of species--specific SNPs. To minimize linkage between diagnostic SNPs and 202 obtain an estimate of introgression across the whole genome, we selected one (the first) alternately 203 fixed SNP per scaffold for the introgression analysis. However, we note that by using the first SNP per 204 scaffold, we bias our data against (larger) well--assembled scaffolds. Therefore, we repeated our 205 analyses of introgression using all available species--diagnostics SNPs to test for an effect of such bias, 206 and obtained qualitatively identical results (not shown). 207

iv. Coalescent analysis of gene flow during lineage divergence 208
We used an analytic likelihood framework to assess the support for alternative models of divergence 209 between G. rivale and G. urbanum with and without gene flow. The method is described in Lohse et 210 al. (2016). Briefly, the analysis is based on a single diploid individual for each species and considers 211 the blockwise site frequency spectrum, i.e. the joint frequencies of four polymorphism types (as in 212 "Patterns of Polymorphism within and between species", above) in short blocks of sequence: i ) 213 heterozygous sites exclusive to G. rivale, ii ) heterozygous sites exclusive to G. urbanum, iii) 214 heterozygous sites shared by both species and iv) fixed differences between species. We counted 215 these site types within 117bp radtags (block), and treated each radtag as an independent block. For 216 randomly mating populations, the polymorphisms at each block represent an independent outcome 217 of the coalescent process, which is a function of the species' history. Since STACKs ignores RAD tags 218 that are monomorphic, we conditioned the likelihood on only observing variable blocks. To do this, 219 we normalized the probability of each blockwise mutational configuration by 1--p_IBS, where p_IBS is 220 the probability of identity in state for a block. calculate genetic diversity (π), described above. We estimated d xy for each scaffold, which we assume to be independent with respect to linkage and determined the mean d xy and its S.E. across 271 scaffolds. All analyzed radtags were present in at least 12 individuals for each species/population 272 type combination, and the same scaffolds were analyzed in allopatric and sympatric estimates of d xy 273 (n=418 scaffolds). 274

Genetic Analysis of Hybrids and Introgression in Sympatric Populations 275
i. Cluster analysis using fastSTRUCTURE 276 As a first approach to analyzing introgression between the two Geum taxa in sympatry, we used

ii. Identifying hybrids and introgressed individuals using species--specific SNPs 287
In our second approach to analyzing recent introgression we used custom scripts and the species--288 specific SNPs identified in (iii.) (above) to estimate the fraction of alleles in each individual within the 289 sympatric Berwickshire sample that is G. rivale in origin (Hybrid Index). SNPs with either un--callable 290 genotypes (based on STACKS' likelihood algorithm) or third alleles (e.g., due to sequencing or 291 alignment error) were excluded from the calculations. We simultaneously tabulated the frequency 292 of SNPs that were heterozygous for the species--diagnostic alleles in each individual.
In general, our analyses did not specify minimum coverage because STACKS accounts for coverage 294 when calling genotypes (or leaving a genotype uncalled) (

Development of SNP markers via ddRAD 319
i. Draft Genome 320 The genome assembly contained a total of 170,030 scaffolds, with an N50 of 24.6Kb and total 321 assembly size of 1.2Gb. In order to identify signs that scaffolds represent multiple copies of the 322 genome, the distributions of coverage per scaffold and percent variant bases per scaffold was 323 assessed. Scaffolds of length less than 10Kb were excluded, leaving 32,182 scaffolds (with a total 324 length of 909 Mb). The distribution of percent variant bases per scaffold is shown in Figure S2, and 325 the distribution of coverage per scaffold is shown in Figure S3. Coverage per scaffold follows a 326 roughly normal distribution, which would be expected if the scaffolds mostly represented the same 327 number of copies of the genome. 328 Core genic regions were well assembled, and appeared to be present in approximately three 329 (haploid) copies as expected for an ancient hexaploid. We searched our assembled genome for 248 330 ultra--conserved eukaryotic genes (CEGs), listed by Parra et al. (2007), using their Core Eukaryotic 331 Genes Mapping Approach. We identified 93% and 97% of the core genes that were complete or 332 partially complete, respectively. On average, complete and partially complete CEGs were 333 represented by 3.39 and 3.82 orthologs per CEG, respectively, with at least 90% of CEGs represented 334 by more than one ortholog. 335 ii. ddRAD tags 336 Following quality filtering, 2.7 * 10 7 reads remained in the MiSeq data, and between 6.0 * 10 7 and 7.3 337 * 10 7 reads remained that derived from the five HiSeq libraries. Following alignments, assembling 338 radtags with STACKS, and applying corrections with rxstacks, our dataset included 230,356 radtags for G. rivale and G. urbanum, collectively. However, coverage was highly stochastic. For example, 340 only 2% (4524) of radtags were represented in more than one half of our samples. 341 iii. Identifying and filtering for paralogs 342 When the STACKS' populations module was used to analyse the raw SNP data from the allopatric 343 populations of the inbreeding taxon G. urbanum, F IS estimates were low and sometimes negative, 344 which is unexpected for a highly inbreeding species. This suggests that paralogous reads have 345 mapped to identical locations and thereby increased individual heterozygosity (see Table S4). We 346 therefore applied our filter for paralogy (rejecting 1344 scaffolds with SNPs that exhibit either excess 347 heterozygosity (>0.5) or negative F IS ) and, unless specifically noted, we henceforth only present 348 results from paralogy--filtered SNPs. 349

Population Genetic Analysis of Allopatric Populations 350 i. Patterns of polymorphism within and between species 351
The majority of SNPs in the dataset were polymorphic in G. rivale but invariant in G. urbanum. This 352 type of polymorphic site occurred approximately four times more frequently than the reverse case 353 (Table 1). 22% of SNPs were alternately fixed between the species (Table 1) and only 1.5% of SNPs 354 were shared polymorphism (Table 1). 355 356

ii. Inbreeding coefficients and population differentiation 357
For the allopatric populations sampled in Britain we obtained F IS estimates close to 0.25 for G. rivale, 358 consistent with leaky self--incompatibility (Ruhsam et al. 2010), and F IS greater than 0.9, consistent 359 with very high selfing rates, for eight of the ten UK 'pure' G. urbanum populations (Table 2). Among 360 the two G. urbanum populations with low F IS , the first population (Mill Wood F IS = 0.0845), included 361 one (of the two) sample(s) that was heterozygous at 92% of the 71 polymorphic SNPs analysed, suggesting that this sample was derived from a recent outcrossing event between two diverged 363 inbred lines. In the second population, Selwyn Wood, only one SNP was recorded as polymorphic, 364 suggesting that this population may have been founded by few (possibly a single) highly selfing 365 individuals. 366 Allopatric populations of G. rivale in Britain exhibited less population differentiation than G. urbanum 367 populations (mean pairwise F ST 0.13 and 0.38 respectively), as expected from their contrasting 368 breeding systems. 369 iii. Coalescent analysis of gene flow during lineage divergence 370 Table S5 summarizes the numbers of each polymorphism type and blocks analysed for the G. 371 urbanum --G.rivale sample pairs. All three pairs had approximately 1400 SNPs distributed among ca. 372 660 blocks (i.e., radtags). Model comparisons for all three sample pairs reject a model of strict 373 divergence, and suggest that introgression has occurred between the Geum species (Table 3), but at 374 a very low rate (see below). For two sample pairs (involving G. rivale samples Ben Lui 1 and Ben 375 Lawers 5), the model of gene flow from G. rivale to G. urbanum (i.e., IM r è u ) fit the data significantly 376 better than the model of strict divergence (i.e., div 2 ), whereas the model IM u è r does not fit 377 significantly better than div 2 ( Table 3). Models that include gene flow (IM r è u and IM u è r ) also fit the 378 data significantly better than the div 2 model for the third pair (involving sample Ben Lui 4), but IM r è u 379 and IM u è r have effectively equal support (Table 3). Results from this latter pair likely differ from the 380 former pairs because it includes a single block with a shared heterozygous site, while the other 381 sample pairs lack shared heterozygous sites (Tables S5, S6). As Ben Lui 4 is likely the most 382 heterozygous G. rivale sample (Table S5), we focus on this pair, but note that parameter estimates 383 (Table S6) and general conclusions (i.e., support for very low introgression rates) are similar for all 384 three sample pairs. 385 386 The three models (div 2 , IM r è u, IM u è r ) yield similar estimates of Ne and divergence time (Ben Lui 4 pair: 387 Table 4; all three sample pairs: Table S6). In general, Ne of G. urbanum is a half to a quarter of that 388 of G. rivale (and their common ancestor), and all three models suggest that the species diverged 389 approximately 2 to 3 million years ago (Table 4; Table S6). Models IM u è r and IM r è u both suggest a 390 low but significant long--term rate of effective gene flow (M ≈ 0.04), of approximately one migrant 391 every 25 generations (Table 4; see Table S6 for M for additional sample pairs). 392 393

Population Genetic Comparison of Allopatric and Sympatric Populations 394
i. Genetic diversity in allopatric and sympatric populations 395 Estimates of genetic diversity (π) were based on 826 radtags, each 117 base pairs in length, totaling 396 96 642 base--pairs; as these radtags were the same as those used to develop species--diagnostic SNPs, 397 they occur in both species. Genetic diversity was greater in G. rivale (mean π ± SE: 'allopatric' 0.0044 398 ± 0.0001; Berwickshire 0.0048 ± 0.0001) than G. urbanum ('allopatric' 0.0011 ± 0.0001; Berwickshire 399 0.0011 ± 0.0001), and was similar between population types for G. urbanum. However G. rivale 400 genetic diversity was slightly greater for sympatric than allopatric samples. Since STACKS ignores 401 monomorphic radtags, we note that these estimates of π should be interpreted in terms of relative, 402 but not absolute diversity (see also Arnold et al. 2013). 403 404

ii. Genetic differentiation between taxa in allopatry and sympatry 405
Genetic differentiation between G. rivale and G. urbanum is similar for UK allopatric and sympatric 406 samples (mean d xy (+/--SE) equals 0.0115 +/--0.0005 and 0.0112 +/--0.0005, respectively). Again, we 407 note that, due to the fact that STACKS' ignores monomorphic radtags, these estimates should be 408 viewed as relative measures (and upper limits) of d xy , and not as absolute estimates.  (Table 3). Instead they are homozygous at a 442 substantial fraction of the species--specific SNPs (15% and 18%), suggesting that they may have been 443 derived by selfing of F1 hybrids. Two further individuals contain alleles derived predominantly from 444 G. rivale, but additionally possess a substantial complement of G. urbanum alleles (29% and 12% 445 respectively) ( Table 5). On the basis of their complement of alleles alone, these plants are most likely 446 to represent first and second generation backcrosses to G. rivale respectively. However, their origin 447 must again be more complex, possibly involving selfing, because a substantial fraction of the G. The development of sequence based markers, using next generation sequencing, poses technical 508 problems in our Geum species. Their genomes are large (1.2 and 1.6 Gb in G. urbanum and G. rivale 509 respectively) and although inheritance is disomic the species are ancient hexaploids. This means that 510 there will generally be three closely related copies of every sequence, creating significant difficulties 511 in distinguishing homologs from paralogs (Smedmark et al. 2003). We adopted the ddRAD technique 512 that allows modulation of the number of RAD tags generated per genome to ensure adequate depth 513 of coverage . To maximize the chance of distinguishing homologs and paralogs 514 we used paired end sequencing yielding long reads, and generated a draft genome of the largely 515 homozygous inbreeding species against which to map our ddRADtag sequences. During the data 516 analysis phase we employed very rigorous filtering of scaffolds and SNPs to exclude paralogs, which 517 removed about three--quarters of scaffolds from our dataset. This filtering contributed to the 518 development of a modest number of SNPs compared to comparable studies in other genera (e.g. 519

Hybrids and Introgression in Sympatric Populations
Populus, Christe et al. 2016; Betula, Zohren et al. 2016), but that was nevertheless orders of 520 magnitude higher than for previous microsatellite based studies; moreover, we developed sufficient 521 SNPs to yield high precision estimates of introgression (e.g., 200 SNPs can estimate introgression of 0.5%). Furthermore our RAD data gave estimates of both inbreeding coefficients and population 523 genetic differentiation that are congruent with results from microsatellite markers, providing 524 confidence in the conclusions drawn from their analysis (Ruhsam et al. 2010(Ruhsam et al. , 2013. 525 Coalescent analyses of RAD data from allopatric populations, which contrast the fit of simple models 526 of the species' histories to polymorphism data, gave species' age estimates of between 2 and 3 527 million years. This is considerably older than some estimates of time of divergence for other sister 528 taxa that possess contrasting outcrossing and selfing mating systems. These range from 50 --100kya 529 for Capsella grandiflora and C. rubella (Brandvain et  Our coalescent analyses necessarily rely on simplifying assumptions. First, the blockwise site--550 frequency spectrum approach we adopted assumes a constant N e and µ across blocks and so ignores 551 effects of heterogeneity in mutation rate and the effects of background selection on linked sites, 552 which could lead to a spurious signal of introgression (Ewing & Jensen 2016). Second, our models are 553 drastic simplifications of the history of these species and, in particular, assume constant N e in G. 554 rivale and the ancestral population, which is unrealistic given the climatic fluctuations and the 555 associated range shifts of European taxa during the Pleistocene. Following species divergence, either 556 a change in N e or introgression can alter the joint site--frequency spectrum (Chen 2012); by assuming 557 N anc = N riv , we potentially conflate changes in N e with introgression, which, again, could lead to a 558 spurious signal of introgression. Given these issues and the low estimates of introgression, we do not 559 interpret the support of IM models over the div 2 model as strong evidence for ancient long--term 560 gene flow between the Geum species. 561 562 As a first test of more recent introgression in Geum we conducted comparisons of genetic diversity 563 and differentiation between species in allopatry vs. sympatry. Introgression, which is only possible in 564 sympatric populations, can potentially increase genetic diversity of the recipient species, and erode 565 genetic differences between species. Hence, our comparisons of π and dxy between sympatric and 566 allopatric populations can provide evidence for or against introgression. Relative genetic diversity 567 was greater for G. rivale than G. urbanum, as expected based on their different mating systems 568 (Charlesworth 2003). In addition, genetic diversity was approximately 9% higher for G. rivale in 569 sympatry than in allopatry, but was equal between population types for G. urbanum. While this is 570 consistent with introgression into G. rivale, but not G. urbanum, other factors could account for this result, including a greater effective population size of G. rivale in sympatry than in allopatry. 572 Furthermore, genetic differentiation between species (dxy) was similar in sympatry and allopatry, 573 providing no evidence for introgression. 574 575 As a second test of recent introgression we analysed the genomic composition of individuals from a 576 region of sympatry in Berwickshire and compared them with samples from isolated, reference 577 populations. The cluster analysis performed by fastSTRUCTURE (which did not rely exclusively on 578 species--diagnostic SNPs), and the analysis based on a limited subset of species--diagnostic SNPs 579 identified the same four individuals as early hybrids and backcrosses, and yielded similar genomic 580 composition estimates that were congruent with morphological classifications. Critically, they 581 suggest that the upper limit for the proportion of introgressed genome present in Berwickshire 582 plants is 3%. However, the real value for introgression is likely to be lower because the assumption 583 that all our species diagnostic SNPs are alternately fixed between the species is unlikely to be true. Selection may be particularly effective in the self--fertilising recombinants that are required to allow 618 introgression from G. rivale into G. urbanum. Here reduced effective recombination rates will lead to 619 correlated selective effects across the genome, and deleterious recessive mutations present as 620 genetic load in genome segments derived from outcrossing G. rivale will be exposed (Hu 2015). Given our ability readily to artificially cross the Geum species, and produce large numbers of later 622 generation hybrids, there is now the potential experimentally to test the hypothesis that strong 623 ecological selection maintains species integrity using appropriate field experiments. DNA sequences: will be made available on European Nucleotide Archive before publication 636 Assembled genome: will be made available (University of Edinburgh link) before publication 637 Morphological data: will be made available on Dryad before publication 638 Scripts: will be made available on github.com before publication 639 640 intervals provided in parentheses. See Table S6 for parameter estimates for all three sample pairs. 821 822 and a generic SphI P2 adapter (Table S3)