Combining species distribution models and population genomics underlines the determinants of range limitation in an emerging parasite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.


Introduction
Determining the factors limiting species ranges has long fascinated ecologists (Gaston 2009, Wiens et al. 2010. This issue has recently gained particular attention because of the major implications of assessing the future geographical ranges of species facing global change (Guisan and Thuiller 2005, Moor et al. 2015, Narouei-Khandan et al. 2016. This is especially true for emerging parasites that have already been observed to expand (or shift) their ranges due to human activities, which can have direct negative feedback on human health and biodiversity (Kutz et al. 2005, Morgan et al. 2009, Escobar et al. 2017. Nonetheless, parasites are involved in a triple interaction between the parasite, the host and their common environment (Wolinska and King 2009). Hence, both the ecological conditions of the surrounding environment and the factors related to host defense and/or parasite virulence are likely to determine parasite spatial distribution (Grosholz 1994, Bozick andReal 2015).
Studies on the environmental determinants of parasite distribution have mostly focused on the abiotic features of the environment that affect the demographic dynamics of parasites (Lachish et al. 2013, Pérez-Rodríguez et al. 2013, Pickles et al. 2013. For instance, in many parasites, a higher temperature has been shown to be associated with a broader seasonal window for an efficient transmission to hosts or vectors, which facilitates the colonization of new areas (Kutz et al. 2005, Paull et al. 2015, Caminade et al. 2019. Nonetheless, Lafferty (2009) argued that non-climatic environmental factors should also be considered in order to understand and predict the spatial distribution of parasites. In particular, for parasites inhabiting aquatic ecosystems a broad range of environmental factors (e.g. water temperature, precipitations, water discharge, habitat configuration etc.) that can be strongly structured spatially (e.g. along the upstream-downstream gradient in rivers) have been shown to influence parasite distribution (Cardon et al. 2011, Blasco-Costa et al. 2013, Marcogliese 2016. Given the complexity of host-parasite relationships, the most recent modeling approaches actually tend to integrate both abiotic (e.g. climate, but also land cover, or habitat fragmentation) and biotic factors (e.g. hosts distribution, hosts density, biotic interactions) as potential environmental determinants of the spatial distribution of parasites (Giannini et al. 2013, Gehman et al. 2018, Byers et al. 2019.
Alternatively, many studies have aimed to understand the distribution of parasites by focusing on the individual characteristics of hosts (physiology, behavior, immunity) or parasites (virulence factors, infectivity) that underlie infection levels (Grosholz 1994, Morand et al. 1996, Mougeot 2005. The host genomic background (e.g. the diversity of genes involved in the major histocompatibility complex) seems to be a key determinant of the individual infection status (Bernatchez and Landry 2003, Wegner et al. 2003, Aguilar et al. 2016). In geographic areas in which founder effects and/or bottlenecks have led to a heterogeneous spatial distribution of alleles, parasite distribution could be spatially limited because of the presence of specific alleles conferring resistance (Zhang et al. 2015, Charbonnel et al. 2019). Next-Generation Sequencing approaches have recently opened new research avenues by giving access to thousands of both non-coding and coding markers related to host defense at the genome-wide level (Wenzel et al. 2015, Benavides et al. 2016, Zueva et al. 2018.
Although studies that focus on the influence of host and parasite characteristics on the distribution of parasites are usually set at small spatial scales (but see Schwabl et al. 2017, Vajana et al. 2018, they provide crucial complementary information to that gathered at larger spatial scales from approaches linking environmental features to parasite distribution (Cardon et al. 2011, Fourcade et al. 2014, Vajana et al. 2018. The distribution of parasite infecting poïkilotherms hosts has been shown to be driven by multiple environmental features acting either directly on the parasite life-cycle or indirectly by altering host characteristics (Cardon et al. 2011, Blasco-Costa et al. 2015. For instance, lower water flow has been demonstrated to suppress fish immune response which subsequently altered fish parasite distribution (Bakke andHarris 1998, Marcogliese 2001). However, only a very few studies have simultaneously investigated the relative (or joint) influence of environmental determinants (sensu largo) and host individual characteristics (and notably the role of their genomic background) on parasite distribution (Barrett et al. 2013, Fourcade et al. 2014, Schwabl et al. 2017. To provide such an integrative assessment of the determinants of parasite distribution, we focused on the fish ectoparasite, Tracheliastes polycolpus which has rapidly emerged in French watersheds from northeastern Europe (Rey et al. 2015). Tracheliastes polycolpus was accidentally introduced in France in the Loire River Basin in the 1920s, most probably by the introduction of its main fish host, Leuciscus idus, for aquaculture (Rey et al. 2015). In 15-20 yr T. polycolpus has expanded its range from the Loire River Basin to almost all French river basins (Rey et al. 2015). However, repeated surveys by local managers never recorded infection by T. polycolpus in the northeastern river basins of France such as the Seine, the Rhin and the Meuse River Basins (Fig. 1). These three river basins thus constitute an intriguing geographically delineated unparasitized area. This is highly surprising because 1) the parasite has been able to rapidly colonize all other river basins in France (which suggests a high dispersal ability), and 2) it is found in sites directly adjacent to these unparasitized river basins (see the inset in Fig. 1) which are connected through man-made channels (that impede host dispersal but not the dispersal of the parasite, Mazé-Guilmo 2016). Examining the environmental determinants and the host individual characteristics that potentially limit the parasite's range expansion is an important, yet challenging, objective that could be fulfilled using integrative approaches.
Here, we aimed to identify the environmental determinants and the individual host characteristics that underlie the spatial distribution of T. polycolpus, and that hence explain why its range expansion is currently limited. Specifically, we tested two non-exclusive hypotheses. First, the 'environmental suitability hypothesis' states that environmental features (sensu largo) are driving the distribution of T. polycolpus in France, and it can be predicted that environmental features of the unparasitized area ( Fig. 1) are unsuitable for the parasite life-cycle. Second, the 'genomic background hypothesis' states that the spatial distribution of T. polycolpus is driven by individual host characteristics, and it can be predicted that the genomic background of hosts from the unparasitized area is different from that of hosts from parasitized areas and this genomic background likely confers a resistance to T. polycolpus. To test these two non-mutually exclusive hypotheses, we combined species distribution models (SDMs) and population genomic approaches. The first approach aimed to statistically test whether the environment was suitable for parasites in the unparasitized area. In contrast, the second approach aimed to test whether the genomic structure of host populations underlined the difference in the host infection status (parasitized versus unparasitized) among geographic areas, and whether some genomic regions were statistically related to the infection status of individual hosts.

Biological models
Tracheliastes polycolpus is a crustacean copepod in which females are ectoparasites of freshwater fish (males are dwarf and free-living, Piasecki 1989). Females anchor to the fins of fish where they are fecundated by a single male and rapidly develop two egg sacs carrying on average one hundred eggs each (Loot et al. 2011). After hatching, free-living copepodids are released into the water; this represents the infective stage of the parasite (Mazé-Guilmo, 2016). Fin damage and inflammation are directly caused by the feeding activities of this parasite, which indirectly affects host fitness (Blanchet et al. 2009b).
In French watersheds, Leuciscus burdigalensis and Leuciscus leuciscus (rostrum dace and common dace, respectively) are the main host species of T. polycolpus. Leuciscus leuciscus inhabits rivers from the northern and eastern parts of France and extends its range far east in Eurasia, whereas L. burdigalensis is endemic to rivers of southwestern France (Fig. 1). Previous studies that were based on microsatellite and mitochondrial markers demonstrated that these two species are structured into five lineages in France (i.e. three and two lineages for L. burdigalensis and L. leuciscus respectively, Costedoat et al. 2014), which are geographically delineated by French river basins (Fig. 1). The unparasitized area encompasses three river basins (Rhin, Seine and Meuse, Fig. 1, 2) that are inhabited by a single lineage of L. leuciscus (hereafter referred to as 'lineage III-unparasitized', Fig. 1), and this lineage also inhabits the Normandy River Basin where the parasite is present (hereafter referred to as 'lineage III-parasitized', Fig. 1). Despite strong genetic differentiation, the two Leuciscus species have very similar ecological requirements. They both live in running water with moderate water temperature (from 14°C to 24°C in warmer months), and they are gregarious species which usually feed on benthic invertebrates. Tracheliastes polycolpus also has several alternative host species including Parachondrostoma toxostoma, Gobio sp., Phoxinus phoxinus, Rutilus rutilus and Squalius cephalus (Lootvoet et al. 2013). Average prevalence on alternative hosts are lower than that on the main host species, as it generally ranges from 1% to 10% in alternative host species and from 10% to 90% in the main hosts species (Lootvoet et al. 2013).

Testing the environmental suitability hypothesis
We aimed to test whether environmental conditions could limit T. polycolpus distribution in the host lineage IIIunparasitized. We built a SDM at the French spatial extent linking environmental features to parasite prevalence estimated at the sampling site level. Prevalence here refers to the number of parasitized hosts over the total number of hosts sampled in a given site. We used the SDM to test which environmental variables affect the distribution of T. polycolpus in France, and then used it to predict parasite prevalence as expected based on the actual environmental features. If the environment suitability hypothesis is verified, we would expect the SDM to predict low parasite prevalence in the lineage III-unparasitized, which would indicate that the environment is unsuitable for the parasite in this part of France.

Field data
158 sites were sampled in French rivers from 2009 to 2011 ( Fig. 1). Sites were sampled using electrofishing by the the Agence Française de la Biodiversité (AFB), the Fédérations Départementales de Pêche et de Protection des Milieux Aquatiques (FDPPMA) or directly by our own team. Sampling was always carried out during summer to avoid seasonal effects on parasite prevalence, and the sampling effort was kept as constant as possible. At each site, the number of dace sampled was recorded, their individual body length (mm) was measured, and their individual infection status (parasitized by T. polycolpus or unparasitized) was determined. Data on infection status was used to calculate parasite prevalence at each site (number of parasitized dace divided by the total number of dace sampled). A total of 1935 dace were sampled (12 dace per site in average), among which 908 L. burdigalensis (468 were parasitized) and 1027 L. leuciscus (257 were parasitized). In most sites, a piece of pelvic fin of each dace (5 mm 2 ) was sampled and preserved in 95% alcohol.

Environmental data
We focused on thirteen abiotic and biotic variables that may explain the spatial distribution of T. polycolpus. We focused on biotic variables measured at the individual host level (but averaged at the site level, see hereafter), and both abiotic and biotic variables measured at the site level (see Table 1 for a detailed list).
At the individual level, we considered host body length (mm), expected heterozygosity (He) computed from a set of microsatellite data (Mazé-Guilmo 2016) and identity of the main host species of T. polycolpus on which we recorded infections (i.e. either L. burdigalensis or L. leuciscus, Table 1). Each of these variables was averaged at the site level before being included in the SDM.
At the site level, we first extracted biotic variables related to the fish community composition using the AFB database (Poulet et al. 2011, Blanchet et al. 2014). This database contains fish occurrence data from yearly fish sampling in many sites evenly distributed across France. For each sampling site we drew a list of fish species occurrence over the period 2009-2011, from which we summed the total number of fish species ('fish species richness') and the total number of host species for T. polycolpus ('number of host species', Table 1) defined according to Lootvoet et al. (2013).
We then extracted abiotic variables belonging to four main categories namely climatic variables, hydrographic variables, landscape variables and habitat fragmentation variables (Table 1). Water temperature was not available across all sampling sites and was thus modeled based on air temperature variables. More specifically, we predicted the mean annual water temperature based on a combination of the mean annual air temperature, and the mean monthly air temperatures for January and July extracted for each site from the SAFRAN database (LeMoine 2002) using a random forest model (see Chevalier et al. 2018 for more details). Hydrographic variables were extracted from the 'Réseau Hydrographique Théorique' database (RHT, Pella et al. 2012) at each site: altitude (m), river slope (%), riverine distance of each site from its river source (km), area of the upstream watershed (km 2 ), Strahler index (i.e. the hierarchical level in the hydrographic network of each river), river width (m), river depth (m), average size of bed sediment (mm), minimum average monthly river flow (m 3 s −1 ) and average natural river flow (m 3 s −1 ). Landscape variables were extracted from the Corine land cover 2012 database (<www.statistiques.developpement-durable.gouv. fr/corine-land-cover-0>) as percentages of main land cover types within a 10 km radius around each site using the first level of land cover description (agricultural land, artificialized lands, forest and semi-natural lands and wetlands grouped with bodies of water). We then synthetized hydrographic In the same way, weirs affect water velocity/discharge and may also influence the survival of the infective larvae Weirs position Weirs position relates to the same hypothesis than above and landscape variables using principal component analyses (one PCA per category) in order to avoid both collinearity and over-parametrization in subsequent linear models (Supplementary material Appendix 1 Fig. A1, Table A1). For the hydrographic variables, we conserved the first two axes (58% and 15% of the variance respectively, Supplementary material Appendix 1 Fig. A1a, Table A1) that we named 'Hydrology' and 'Topology' respectively. For landscape variables we conserved the first two axes (50% and 28% of the variance respectively, Supplementary material Appendix 1 Fig. A1b, Table A1) that we arbitrarily named 'Landscape 1' and 'Landscape 2'. Finally, we extracted fragmentation variables from the 'Référentiel des Obstacles à l'Ecoulement' database (ROE, <www.onema.fr/REFERENTIEL-DES-OBSTACLES-A-L>) to quantify habitat fragmentation by weirs (number of obstacles being < 3 m height) or dams (number of obstacles being > 3 m height) within a 10 km radius around each sampling site. We input whether a weir was present directly upstream, downstream or both upstream and downstream for each sampling site (within a 2 km radius) to generate a more direct estimate of fragmentation at the sampling site level (categorical variable: upstream, downstream, both upstream and downstream or no weir at all). All abiotic characteristics were extracted using a Geographic Information System (QGIS Development Team 2008) and PCAs were run using the R package 'ade4' (Dray et al. 2007).

Statistical analyses
We developed a SDM to explain T. polycolpus prevalence at the site level by using a model accounting for spatial dependence among sites as we detected a strong signal of spatial autocorrelation for T. polycolpus prevalence (Moran test, p-value < 0.001). We used the R package 'spaMM' (Rousset and Ferdy 2014) and its function 'corrHLfit' to specify geographic coordinates of each sampling site into continuous random terms. We first built a full model fitted on the dataset covering the current range of T. polycolpus (i.e. parasitized areas, Fig. 1, n = 128 sites) to test the significance of each variable on parasite prevalence. It included all explicative variables (Table 1) as well as the quadratic term for body length because preliminary visual inspections suggested a non-linear relationship between prevalence and body length. A binomial error terms distribution with a logit link function was fitted to the prevalence data. Then, likelihood ratio tests based on PQL/L fits were performed by contrasting the full model to a model excluding the explicative variable of interest, so as to generate a χ 2 statistic and its associated p-value for each explicative variable.
Secondly, we parametrized the model on a restricted dataset including 75% of the sampling sites covering the current range of T. polycolpus (i.e. parasitized areas, Fig. 1). The parameterized model allowed predicting T. polycolpus prevalence both on the remaining 25% of the dataset covering the current range of T. polycolpus and in sites out of the range limit of T. polycolpus (Lineage III-unparasitized, Fig. 1). This procedure was reiterated 100 times to account for the uncertainty arising from the selection of the calibration dataset. The predictive power of the model was assessed at the host lineage level and at the site level within each host lineage using the mean absolute error (MAE) computed using the R package 'caret' (Kuhn 2012). We finally compared predicted prevalence between lineages using contrast tests using the R package 'lsmeans' and a binomial generalized linear mixed model (Lenth 2016).

Testing the role of the genetic background
We aimed to test whether genomic background of hosts could explain the individual infection status of T. polycolpus, and hence its range limitation in the lineage III-unparasitized (Fig. 1). In this section, models were set at the individual level, with the response variable being whether or not a given individual host is parasitized. We investigated whether some genomic structure underlined the absence of T. polycolpus in individuals from the lineage III-unparasitized. Secondly, we investigated genomic variation associated with T. polycolpus occurrence at the individual level.

Data collection and sequencing
DNA was extracted from the piece of pelvic fin as described in Aljanabi and Martinez (1997), and collected on 96 individual hosts belonging to the species L. leuciscus so as to reflect the variability of parasite prevalence and the different lineages (lineage II, lineage III-parasitized and lineage III-unparasitized). Overall, 28 individuals from the lineage II were sampled at four sampling sites belonging to the Rhone River Basin (Fig. 2) and we selected half of them with parasites and the other half without. We then sampled 28 individuals from the lineage III-parasitized at five sampling sites belonging to the Normandy River Basin (Fig. 2), and we selected half of them with parasites and the other half without. Finally, 40 individuals from to the lineage IIIunparasitized were sampled at five sampling sites localized in the Rhin, Seine and Meuse Rivers Basins (Fig. 2).
We used a paired-end restriction site-associated DNA sequencing approach (RAD-seq, Baird et al. 2008) to sequence a reduced representation of the host genome into reads of 145 bp. Briefly, DNA quality was tested using a spectrophotometer (Nanodrop ND8000), quantified using a fluorimeter and standardized to 200 ng. DNA was sheared through sonication, digested using the sbf1 enzyme for one hour at 37°C and then inactivated at 65°C. Adapters were ligated and samples were individually barcoded. Sequences were amplified through ten PCR cycles using the kit KAPA HiFi HotStart ReadyMix. Libraries were paired-end sequenced on one lane of an Illumina HISeq3000 containing 96 independent banks (one per individual) and using the chemistry V4 from kits TruSeq (2 × 125 pb). Libraries were finally quality checked using BioAnalyzer. Library preparation and sequencing were performed at the GeT-PlaGe core facility (Toulouse, France).

Data filtering and catalogue construction
We relied on the 'Stacks' pipeline (ver. 2.3.4) to both filter the raw dataset and identify SNPs genotyped at the individual level (Catchen et al. 2013). First, we removed reads with unidentified bases and/or with low quality scores (below a phred33 quality score of 20) using 'process_radtags'. PCR duplicates were also removed using 'clone_filter'. To build our catalogue of loci, alleles and ultimately SNPs from the filtered reads, we ran the 'denovo_map.pl' using as parameters value: m = 3, M = 2, n = 4 (i.e. set of parameters maximizing the number of SNPs having high likelihood according to preliminary tests). Finally, 'populations' was used to select loci with a minimum allele frequency above 3%, with a likelihood over −20, with a minimum coverage depth greater than 3, occurring in the three populations and in at least 40% of the individuals for each population. We also kept only one SNP per locus to limit linkage disequilibrium between SNPs. When applied to the whole database (i.e. 95 individuals because one individual was discarded due to its poor sequencing quality), this parameter set resulted in a catalog of 14 255 SNPs. A final filtering excluded SNPs with more than 30% of missing values from further analyzes using the library 'poppr' in R (Kamvar et al. 2014). This resulted in a final dataset of 2543 SNPs genotyped from 95 individuals.

Population genetic structure
A discriminant analysis of principal component (DAPC) was conducted using 'adegenet' (Jombart 2008) in R to first identify a potential genetic structure among populations using the 'find.cluster' function. All principal components were kept, and the best number of clusters was selected using the Bayesian information criterion (BIC). A second DAPC was run using 30 principal components (Supplementary material Appendix 1 Fig. A2) and the k number of clusters selected previously. The k − 1 discriminant functions (i.e. individuals' scores extracted along the k − 1 discriminant axis) were then extracted. We finally constructed a generalized linear model (GLM) with a quasibinomial error term linking parasite prevalence at the individual level to the k − 1 discriminant functions of the DAPC to test their significance. In this model, we also controlled for the host body length and the individual level of genomic diversity (measured as expected heterozygosity, He, calculated from allelic frequencies over all 2543 SNPs using the 'adegenet' R package). These later variables were included as continuous explicative variables (together with their respective quadratic terms) since they have been shown to be (sometimes non-linearly) related to individual parasite prevalence in this host-parasite system (Blanchet et al. 2009a).

Genome wide association study
We used a genome wide association study (GWAS) approach to test whether prevalence at the individual level was significantly related to particular SNPs. We relied on GLMs with a quasibinomial error term linking parasite prevalence at the individual level (parasitized or unparasitized, response variable) to the genotype (coded as a categorical factor: homozygote for allele a or b and heterozygote ab) of each SNP. One model per SNP marker (i.e. 2543) was run in which we controlled for population structure using the k − 1 discriminant functions described previously. We also included the host body length and He as continuous co-variables with their respective quadratic terms. Type I ANOVAs were performed to extract the p-value associated to each SNP marker. p-values were then corrected for multiple comparisons using a false discovery rate approach (FDR) and using the 'fdrtool' R package (Strimmer 2008).
The R software (ver. 3.5.3, <www.r-project.org>) was used to run statistical analyses when no other software is mentioned.

Testing the environmental suitability hypothesis
In the parasitized area, Tracheliastes polycolpus prevalence was significantly related to host body length, host species identity, temperature, hydrology, the first synthetic variable of landscape composition and the position of weirs around the sampling site (Table 2). These relationships suggested that T. polycolpus prevalence was higher in L. burdigalensis than in L. leuciscus and that T. polycolpus prevalence was higher for hosts with an intermediate body length than for hosts with a smaller or larger body length (Supplementary material  Appendix 1 Table A2). Furthermore, T. polycolpus prevalence was higher in cold sites localized in small rivers and preferentially associated with agricultural lands rather than warm sites localized in large rivers and associated with forest or seminatural lands (Supplementary material Appendix 1 Table A2, Fig. A1). Finally, fragmentation by weirs had a significant impact on T. polycolpus prevalence, with higher prevalence expected in sites in which a weir was present downstream (compared to other positions of weirs or no weir at all, Table 2 and Supplementary material Appendix 1 Table A2).
Computation of model MAE indicated an overall good predictive power of the model, at the lineage level  0.151, Fig. 3) and at the site level within most host lineages (ranging from 0.258 in lineage II to 0.367 in lineage VI). In the five host lineages localized in the current range of T. polycolpus, prevalences predicted from our model were very close to observed prevalences (except in the lineage VI, Fig. 3). However, in the unparasitized area of the lineage III (i.e. lineage III-unparasitized, Fig. 3), the predicted parasite prevalence was high (0.36) and similar to the prevalence values observed and predicted in areas and lineages in which T. polycolpus is currently present (i.e. L. burdigalensis, lineage II, lineage III-parasitized, Fig. 3). Specifically, in lineage III-unparasitized, the predicted parasite prevalence was not significantly different neither from the prevalence predicted in the lineage III-parasitized (Contrast test, p = 0.220, Supplementary material Appendix 1 Fig. A3) nor from the prevalence predicted in the lineage II (Contrast test, p = 0.102, Supplementary material Appendix 1 Fig. A3). This indicated that the environment was not unsuitable for T. polycolpus in the lineage III-unparasitized.

Testing the role of the genetic background
The clustering analysis based on 2543 SNPs revealed that the best clustering was achieved for k = 3 (Supplementary material Appendix 1 Fig. A4). The two first axes of the DAPC performed for k = 3 and based on 30 PCA components summarized 67.3% of the total variance (Fig. 4). The first axis discriminated the two dace lineages (i.e. lineage II and lineage III), as well as individuals of the lineage III belonging to the parasitized area from individuals of the same lineage but belonging to the unparasitized area (Fig. 4 and Supplementary material Appendix 1 Fig.  A5). Interestingly, the second axis discriminated individuals localized in the parasitized areas from individuals localized in the unparasitized area (lineages II and lineage III-parasitized versus lineage III-unparasitized, Fig. 4). This suggests that the genomic background of dace localized in the unparasitized area is different from the genomic background of dace belonging to the lineage II, but also from dace belonging to the lineage III but localized in the parasitized area. Furthermore, we found that both the first and second discriminant functions had a significant effect on individual infection status (F = 9.750; df = 88,1; p = 0.002 and F = 77.830; df = 88,1; p < 0.001 respectively). Finally, when controlling for population structure, the GWAS revealed 90 SNPs significantly associated to T. polycolpus prevalence measured at the individual level for L. leuciscus (at a FDR threshold of 5%, Supplementary material Appendix 1 Table A3).

Discussion
We identified factors limiting the spatial distribution of an emerging parasite in its invasive range. Combining a species distribution model and population genomics approaches we suggest that host characteristics, such as the genomic background of individuals living in the unparasitized area, are more likely than environmental features to explain the current spatial distribution of Tracheliastes polycolpus, and hence why it has not invaded some areas.
We first highlighted a complex role of environmental factors involving multifactorial drivers of T. polycolpus spatial distribution. Climate is one of the most studied drivers in parasite distribution (Kutz et al. 2005, Barrett et al. 2012, Caminade et al. 2019. We consistently found a significant effect of water temperature. In particular, cold sites were associated with high prevalence, which matches the biology of this parasite species and also the host habitat preferences , Franke et al. 2019. However, we also found significant effects of various environmental drivers such as host related factors (host body length and host species), the river topography, the surrounding landscape composition and the anthropogenic fragmentation. Overall, these later findings corroborate previous findings on this species. For instance, as in Blanchet et al. (2009a), we found that parasite attachment is favored in hosts with intermediate body length, probably because of a trade-off between the surface available to the parasite to anchor and age-related ability of hosts to resist/tolerate the parasite (Cardon et al. 2011). Similarly, we found that sites localized in small rivers with low stream velocity (negative effect of the 'Hydrology' variable, Supplementary material Appendix 1 Fig. A1) and with a weir downstream were associated to high prevalence. These effects were expected as the free-living infective stage of this parasite requires low water velocity for development (Loot et al. 2004). More generally, such a multifactorial environmental determinism of T. polycolpus distribution strongly supports the growing view that integrative distribution models accounting for multiple environmental drivers, including biological interactions, are particularly powerful for predicting the spatial distribution of parasites (Lafferty 2009, Giannini et al. 2013, Marcogliese 2016. Based on predictions from this species distribution model, we further showed that environmental suitability was unlikely to explain the range limitation of T. polycolpus. Prevalence levels predicted from environmental conditions characterizing the area in which the parasite is currently absent were similar to those characterizing areas in which the parasite is recurrently observed. This suggests that these conditions are suitable for T. polycolpus. One could argue that in unparasitized French river basins we may have missed some key environmental variables better predicting the occurrence of the parasite. Nonetheless, our species distribution model showed very good predictive performances for all other host lineages highlighting a very good predictive power. It is noteworthy that predicted parasite prevalences were underestimated for the host lineage VI. Interestingly, the lineage VI is also the most phylogenetically differentiated lineage of L. burdigalensis and is actually considered as a separate endemic species by some authors (L. bearnensis, Kottelat and Freyhof 2007 but see Costedoat et al. 2014). Discrepancy between predicted and observed prevalence in this phylogenetically divergent  (b) and (c) second axis of the DAPC. Each vertical line corresponds to the contribution of a unique SNP. SNP whose value is above the grey horizontal line are SNP contributing to more than 2‰ to the genomic differentiation between clusters. The SNPs are identified as follows: loci number_position of the SNP. lineage as well as in the host lineage III-unparasitized suggest that other factors than environmental suitability, notably host genomic background, could determine patterns of T. polycolpus prevalence, and consequently its range limitation.
Using a population genomic approach based on 2543 SNPs, we showed that host genomic background is associated with T. polycolpus distribution, both at the individual and population levels. We first revealed a marked host population structure in L. leuciscus that spatially matches the distribution of T. polycolpus. We showed that the lineage III proposed by Costedoat et al. (2014) is actually separated into a parasitized 'sub-lineage' localized in the Normandy River Basins (Fig. 2) and unparasitized 'sub-lineage' in the Rhin-Seine-Meuse River Basins. Previous studies on the genetic structure of the Leuciscus species complex were based on microsatellite, allozyme and mitochondrial markers, and covered a more extended spatial extent but with fewer sampled individuals per location (Costedoat et al. 2006(Costedoat et al. , 2014. Here, the finer genetic structure that we detected probably results from the higher resolution of the SNP panel we used, as well as from the more intensive sampling of the Leuciscus species within lineages. For technical reasons, we focused on a subset of T. polycolpus host distribution in France (Fig. 1, 2). We can thus speculate that within-lineage genomic structure of host populations could be a more general mechanism underlying patterns of T. polycolpus prevalence across France (e.g. within the lineage II, where southern sites sampled also display null prevalence, Fig. 1). We further found that 90 SNPs were associated with individual infection status, even when controlling for population structure. Both the genetic uniqueness of host populations localized in the unparasitized area and the association of SNPs with individual infection status suggest that T. polycolpus range limitation is due to local host resistance, hence providing a strong support to the 'host genomic background hypothesis'.
The emergence of resistance against T. polycolpus in northeastern river basins is an intriguing question and two main hypotheses can be formulated: one related to macro-evolutionary processes and the other one related to recent and rapid adaptation of hosts to this invasive parasite. The 'macroevolutionary' hypothesis assumes that resistance to T. polycolpus emerged during the differentiation of Leuciscus lineages over time. Costedoat et al. (2006) dated the differentiation of the lineage III back to the Pleistocene (~500 000 yr ago), with the colonization of river basins of northern France from eastern Europe. The differentiation between the lineage IIIparasitized and lineage III-unparasitized (and hence the emergence of resistance) may have consequently occurred during late Pleistocene, probably after the former Channel River had retracted, hence causing a loss of connectivity between Normandy River Basins and Rhin-Seine-Meuse River Basins (which occurred ~10-12 000 yr ago, Antoine et al. 2007, Hugueny et al. 2011, Dias et al. 2014. Alternatively, the recent and rapid adaptation scenario assumes rapid adaptive changes in hosts along with the introduction of T. polycolpus in France during the 1920s (Rey et al. 2015). Some examples of rapid evolution of resistance to disease in wild host populations have been documented (Duffy and Sivars-Becker 2007, Bonneaud et al. 2011, Epstein et al. 2016), but these examples remain rare. In the L. leuciscus-T. polycolpus system, adaptive evolution toward resistance cannot be excluded since we found significant SNP-infection status associations, but whether emergence of resistance is due to drift or to adaptive evolution, and whether this occurred far in the past or very recently are questions that would need to be fully tested.
Overall, our results demonstrated the importance of combining approaches for predicting the distribution range of co-evolving and interacting species. Species distribution modelling approaches focusing on parasite species have, so far, mainly focused on environmental factors (Ostfeld et al. 2005, Caminade et al. 2019). However, our results suggest that predictions of future parasite distribution (such as northern range expansion as often predicted in the context of climate change, Altizer et al. 2013, Carter 2018, when only based on environmental drivers should be carefully interpreted. Here, we indeed provided an empirical example in which the presence of a parasite would have been expected in an area where the parasite actually never achieved to infect local hosts. This suggests that ecological features alone are not sufficient to explain the range and potential spread of emerging parasites (Pérez-Rodríguez et al. 2013, Anderson 2017. Furthermore, the genomic-based approach allowed demonstrating that the host genomic background was likely a critical factor for explaining T. polycolpus distribution both at the population and the individual levels. Population genomic approaches allow accounting for evolutionary processes leading to parasite resistance such as local adaptation, which, from our results, appears to be a crucial process to take into account in order to improve predictions regarding the potential spread of emerging pathogens. To conclude, our study illustrates the usefulness of integrative approaches to reveal the determinants of emerging pathogens at large spatial scales as well as to predict future spread.