Phylin 2.0: Extending the phylogeographical interpolation method to include uncertainty and user‐defined distance metrics
Abstract
Estimating geographical ranges of intra‐specific evolutionary lineages is crucial to the fields of biogeography, evolution, and biodiversity conservation. Models of isolation mechanisms often consider multiple distances in order to explain genetic divergence. Yet, the available methods to estimate the geographical ranges of lineages are based on direct geographical distances, neglecting other distance metrics that can better explain the spatial genetic structure. We extended the phylogeographical interpolation method (phylin) in order to accommodate user‐defined distance metrics and to incorporate the uncertainty associated with genetic distance calculation. These new features were tested with simulated and empirical data sets. Multiple distance matrices were generated including geographical, resistance, and environmental distances to derive maps of lineage occurrence. The new additions to this method improved the ability to predict lineage occurrence, even with low sample size. We used a regression framework to quantify the relationship between the genetic divergence and competing distance matrices representing potential isolation processes that are subsequently used in the interpolation process. Including uncertainty in tree topology and the different distance matrices improved the robustness of the variograms, allowing a better fit of the theoretical model of spatial dependence. The improvements to the method increase its potential application in other fields. Accurately mapping genetic divergence can help to locate potential contact zones between lineages as well as barriers to gene flow, which has a broad interest in biogeographical and evolutionary studies. Additionally, conservation efforts could benefit from the integration of genetic variation and landscape features in a spatially explicit framework.
1 INTRODUCTION
Estimating geographical ranges of intraspecific evolutionary lineages is crucial to understanding the evolutionary history of a species, for testing biogeographical hypotheses, and aids conservation decisions by taking into account evolutionary potential (Carvalho et al., 2017; D'Amen, Zimmermann, & Pearman, 2013; Dinis et al., 2019; Storfer, Murphy, Spear, Holderegger, & Waits, 2010). Spatial predictions of genetic divergence have been a valuable tool, particularly when research is inherently spatially explicit, as is the case with biogeography, conservation prioritization, and landscape genetics. A model of spatial genetic divergence can be used to predict the most probable genetic group for an observation without molecular information available (Dinis et al., 2019) or to include an evolutionary layer in a conservation design strategy (Carvalho et al., 2017). If the model is able to incorporate spatial processes affected by different factors (e.g., landscape or climate) rather than geography alone, the potential applications expand. The model can be applied directly onto landscape genetics inferences to spatialize possible isolation mechanisms and can be useful for climate impact analyses on genetic divergence. However, genetic divergence is often summarized as pairwise genetic distances between samples, which increases the difficulty in translating the information to spatially explicit layers. Some methods developed to predict lineage occurrence include column‐wise interpolation of genetic distance matrices (Miller, Bellinger, Forsman, & Haig, 2006; Vandergast, Bohonak, Hathaway, Boys, & Fisher, 2008), ecological niche modelling (Rosauer, Catullo, Vanderwal, & Moussalli, 2015; Vasconcelos, Brito, Carvalho, Carranza, & James Harris, 2012) and phylogeographical interpolation (phylin; Tarroso, Velo‐Antón, & Carvalho, 2015).
The spatial dependence of genetic divergence is estimated in phylin through the use of an empirical variogram where all sample pairs are represented, which allows fitting a theoretical model that is used to derive the spatial interpolation. phylin thus uses genetic distances between every combination of two samples and includes long distance information into the interpolation procedure, instead of only analysing neighbour sample pairs (Vandergast et al., 2008). Additionally, the method neither needs to generate midpoints between samples for representation of the genetic divergence (Vandergast et al., 2008), nor to generate models of the species' ecological niche (Rosauer et al., 2015). The phylin method shares similarities with generalized dissimilarity models (GDMs; Ferrier, Manion, Elith, & Richardson, 2007). Both methods relate genetic dissimilarity to spatial distance but, whereas phylin uses a geostatistical approach by modelling the semivariance at different distance lags, GDM uses a regression, relating each pairwise genetic distance to the respective pair in the independent matrices variables. Spatial interpolation with GDM is often used to represent dissimilarities after dimension reduction using an ordination analysis and it is based on the predictions by the regression plane (Ferrier et al., 2007). Conversely, phylin uses a model of covariance to derive weights for each sample at the location where a prediction will be estimated. The phylin method is not limited by the lack of independence of pairwise distances as in regression‐based methods (Clarke, Rothery, & Raybould, 2002) and produces maps of occurrence based on multiple thresholds in the tree (Tarroso et al., 2015) without additional methods. However, phylin assumes that genetic similarity between populations decreases with increasing geographical distance, as predicted by the model of isolation by distance (IBD; Wright, 1943). Although it performs well under most circumstances (with reasonable and spatially homogeneous sampling), the method does not account for habitat suitability and geographical barriers conditioning demographic, ecological and evolutionary processes, which ultimately drive spatial structure of genetic diversity and lineage ranges (McRae, 2006; Spear, Balkenhol, Fortin, McRae, & Scribner, 2010; Wang, 2013).
Recent advances in models of genetic isolation have improved the IBD model with spatial and habitat heterogeneity, by using either resistance distances (IBR; McRae, 2006) or environmental distances (IBE; Nosil, Egan, & Funk, 2008; Wang & Bradburd, 2014). The IBR model calculates pairwise distances taking into account the resistance offered by the landscape to the migration of individuals, and it is computed with least cost paths (Vignieri, 2005) or using circuit theory algorithms (McRae, 2006; Spear et al., 2010). The IBE model is independent of geographical distances and landscape resistance, and correlates genetic differentiation with environmental differences (Wang & Bradburd, 2014).
Quantifying the contribution of isolation to shaping genetic diversity can be achieved with several methods that are mostly regression‐based. These methods can relate to pairwise genetic distance matrix (e.g., FST, cophenetic distance) with multiple dissimilarity matrices constructed from landscape and environmental variables (Balkenhol, Cushman, Storfer, & Waits, 2015; Balkenhol, Waits, & Dezzani, 2009), and are currently the dominant framework for testing landscape genetics hypotheses (Shirk, Landguth, & Cushman, 2018). The relationship between landscape and genetic differentiation is usually identified with Mantel tests or Mantel‐like regressions as the multiple regression on distance matrices (Legendre, Lapointe, & Casgrain, 1994; Wang, 2013) or after applying an ordination technique (Balkenhol et al., 2015; Legendre & Fortin, 2010). Another family of methods includes the covariance between samples in linear mixed models (Jaffé et al., 2016; Lanes et al., 2018), or generalized least squares (GLS; Jha, 2015), in order to meet the sample independence assumption. These methods are particularly accurate when used in combination with maximum‐likelihood population effects (MLPEs) to account for the non‐independence inherent to pairwise distance matrices (Clarke et al., 2002; Shirk et al., 2018; Van Strien, Keller, & Holderegger, 2012). Using such regression with matrices related to isolation models allows testing competing hypotheses underlying the mechanism of isolation and quantifying them through the coefficients obtained in the regression equation (Balkenhol et al., 2015; Shirk et al., 2018).
Incorporating resistance and environmental distances into the phylin method adds the potential to derive more accurate predictions of the ranges of lineages, which, in turn, widens the application of the method to fields such as phylogeography, conservation and landscape genetics.
In this study we describe and test new additions to the phylin package, including the increased flexibility to incorporate user‐defined distance metrics, and accounting for the uncertainty in genetic distances when generating the empirical variograms. Specifically, we aim to (a) test the accuracy of prediction of the geographical ranges of lineages under different distance metrics (geographical, resistance and environmental) using simulated data; (b) evaluate the effect of the sampling size on the accuracy of model prediction when using the three distance metrics and (c) apply the method to an empirical data set to predict lineage ranges and to identify the major factors driving the spatial genetic structure.
For the empirical analysis, we used the Mediterranean pond turtle, Mauremys leprosa, as a case study. The biogeographical history of this species is well studied, and exemplifies the effect of landscape variables (the Atlas Mountains in Morocco and the Strait of Gibraltar) offering different degrees of resistance to gene flow and shaping the current genetic structure at intraspecific level (Veríssimo et al., 2016).
2 METHODS
2.1 New additions to phylin
(1)
(2)
(3)We improved the method in the current version of phylin (version 2.0) to include two important new features. First, it allows use of the uncertainty associated with the genetic dissimilarity matrix to build the variogram. Uncertainty can be derived from multiple genetic distance matrices calculated from the posterior distribution of trees or from any other uncertainty metric associated with the genetic distance. The uncertainty is added when calculating the semivariance for each h distance lag (Equation 1). In this case, multiple estimated values for semivariance can be obtained for each distance lag according to the genetic distance matrices and, thus, semivariance can be estimated with a central tendency metric and a confidence interval. Building the variogram with this information facilitates the process of model fitting as it allows a better estimation of the limits of the semivariance at each distance lag. Second, phylin now allows more flexibility when defining the distance metric used to represent the spatial relationships between samples. The distance lag h is based on the geographical distance between samples (Isaaks & Srivastava, 1989; Tarroso et al., 2015), although other distance metrics are often used to explain genetic divergence (Balkenhol et al., 2009). Using other distance metrics rather than straight line geographical distances has implications for the way that the variogram and spatial interpolation are calculated. The distance lag h in the variogram needs to be calculated in terms of the new distance, as there are also two other matrices that are used in the interpolation process (Equation 3): the pairwise distance between samples and the distance from sample locations to the grid coordinates where interpolation is calculated. The updated 'krig' function in the phylin package was rewritten and now allows the inclusion of an additional function for calculation of the distance between location pairs, which can generate the default Euclidean distances or any other custom distance calculation, including least‐cost paths or resistance distance, but not limited to those. The user‐defined distance function is first used to produce a distance matrix to generate the variogram together with the genetic dissimilarity matrix. It is used again in the interpolation process within the ‘krig’ function to derive weights for each sample point influence in the prediction. To illustrate this new feature, we use resistance distances, as they are frequently used to explain genetic patterns in landscape genetics studies (Balkenhol et al., 2015; McRae, 2006; Noguerales, Cordero, & Ortego, 2016; Spear et al., 2010; Velo‐Antón, Parra, Parra‐Olea, & Zamudio, 2013). To demonstrate the flexibility of the new function we also use environmental distances and the distances resulting from fitting geographical, environmental and resistance distances to genetic distances using a GLS regression with MLPE (Clarke et al., 2002).
The phylin package is written in R programming language and is freely available as an open‐source package in the Comprehensive R Archive Network (CRAN) repository.
2.2 Simulated data
We simulated two environmental surfaces (ESs) and a response surface. The first ES represents a permeable barrier to dispersal, which could be compared to a landscape feature similar to a river or an elevation depression; the second represents an ecological gradient (e.g., climate) with two different habitats located at each side of the barrier. The barrier effect is used to simulate a process generating IBR (McRae, 2006), whereas the ecological gradient is used as a process generating IBE (Nosil et al., 2008; Wang & Bradburd, 2014). The response surface represents the potential distribution of a virtual species, calculated as a weighted combination of the first ES plus an autocorrelated surface and a spatially independent normal error, then converted to binary presence/absence data with a threshold. The final data set has a size of 41 × 41 cells on the extent −1 to 1 in both axes (additional details on the simulated surfaces are provided in Figure S1).
(4)
(5)
(0, 0.01). The combination scales the distance matrices to the range [0,1], and each contributes equally to the simulated genetic distance matrix. Other scaling methods are commonly used, for example standardization to Z‐scores (e.g., Velo‐Antón et al., 2013), but our preferred method avoids generating negative values in the genetic distance matrix. Thus, the resulting genetic structure combines the contributions from isolation by distance, resistance and environment, as it is often assessed in landscape genetic studies (Antunes et al., 2018; Noguerales et al., 2016; Wang, 2013).
We converted the genetic distance matrix into an ultrametric phylogenetic tree using the UPGMA algorithm with 'hclust' function from R base version 3.4.4 (R Core Team, 2018) and then converted it to a class 'phylo' with the ape package version 5.0 (Paradis, Claude, & Strimmer, 2004). Four lineages were derived from the phylogenetic tree using the full occurrence data set. Although in this case there is no biological reason to choose any number of lineages, four lineages allow us to fully test our method due to the presence of simulated divergence and contact on each side of the barrier. Moreover, the lineages' ranges provide enough locations to test the effect of sample size. For the following analysis with simulated data, the genetic distance matrices were constructed as the square root of cophenetic distances of a pruned tree with random sampling of presence locations. As De Vienne, Aguileta, and Ollier (2011) demonstrated, cophenetic distances are equivalent to the squared Euclidean distances and thus the square root of these distances restores the Euclidean properties of the distance matrix and helps to achieve better variograms, particularly at shorter geographical distances.
2.3 Empirical data
The study area comprises the entire distribution of Mauremys leprosa (12.19°W, 11.21°E, 27.70°N, 43.81°N; see Figure S2), as defined by the occurrence data from the Spanish (Pleguezuelos, Màrquez, & Lizana, 2004, http://siare.herpetologica.es/bdh/distribucion) and Portuguese atlases (Loureiro, Ferrand, Carretero, & Paulo, 2008) of amphibians and reptiles, and from previous publications (Veríssimo et al., 2016), but discarding the few populations in southern France which have been admixed with accidental or intentional introductions from Africa (Palacios et al., 2015; Veríssimo et al., 2016). The Moroccan populations of M. leprosa contain all the genetic lineages found within the species, with two of them expanding to Tunisia and the Iberian Peninsula, respectively (Veríssimo et al., 2016). Genetic diversity in this species is highly structured by the Atlas Mountains, which act as a barrier to gene flow. The Atlantic Ocean and the Mediterranean Sea (hereafter referred as seawater) act as a permeable barrier impeding interchange of individuals between Europe and Africa. At the Straight of Gibraltar, where the seawater is at its narrowest, the barrier is permeable and allowed the rapid colonization of the Iberian Peninsula with a single sub‐lineage (Veríssimo et al., 2016).
We used the mitochondrial‐based phylogenetic tree with 163 samples produced by Veríssimo et al. (2016). We collected geographical coordinates (WGS84) for each sample with 1,000 m resolution (Veríssimo et al., 2016), and generated the genetic distance matrix from the phylogenetic tree using cophenetic distances.
We downloaded elevation data from the CGIAR Consortium for Spatial Information (http://srtm.csi.cgiar.org/) and Mean Temperature of the Warmest Quarter (MTWQ) from WorldClim (http://worldclim.org/; Hijmans, Cameron, Parra, Jones, & Jarvis, 2005) with 30 arcseconds spatial resolution. To distinguish between land masses and water, we considered all cells in the study area that did not have elevation data as water. We upscaled all spatial variables to 0.09° (~10 km) by averaging the values, using the Geospatial Data Abstraction Library (http://www.gdal.org) and the rgdal package version 1.2‐18 (Bivand, Keitt, & Rowlingson, 2018) in R language.
2.4 Spatial analyses
We predict that the typical inclusion of the phylin method in a landscape genetics framework will follow three steps: (a) optimization of a resistance surface, (b) quantification of the contribution of different isolation mechanisms, (c) interpolation of the genetic entities by means of phylin. We illustrate this framework with the simulated and empirical data sets when predicting a lineage's probability of occurrence with strong landscape barriers. The simulated data serve a triple purpose: demonstrating basic usage of the phylin package, testing the spatial congruence of the predictions, and testing the effect of sample size on the accuracy of the predictions.
We sampled the simulated data set by randomly selecting 200 occurrences and derived the following pairwise distance matrices: (a) the geographical distance calculated as the Euclidean distance between sampled locations; (b) the environmental distance using the Euclidean distance between the values of the second ES; (c) a resistance distance using the conductance surface calculated from the first environmental surface as described above; and (d) a fitted distance from the predictions of a GLS regression of the genetic distances over the geographical, resistance and environmental distances, with an MLPE correlation structure (Clarke et al., 2002) using the nlme version 3.1 package (Pinheiro, Bates, DebRoy, Sarkar, & R Core Team, 2017) and the cormlpe package (https://github.com/nspope/corMLPE). The expected genetic dissimilarity is given by the regression line when taking into account the different distances calculated (Clarke et al., 2002). The common linear regression has assumptions of nonindependence that are not met with pairwise observations. However, MLPE incorporates random intercepts that accommodate the dependence of pairwise observations, thus allowing a likelihood‐based regression framework (Clarke et al., 2002; Jha, 2015). We used raw distance matrices in the GLS instead of mean‐centred or standardized distances for two reasons: first, our aim was to obtain a linear equation that allowed fitting the original distance matrices to the observed genetic dissimilarity, rather than comparing predictor variable coefficients; and second, standardizing variables would result in fitted distances with negative values, which would pose problems to the usage of the resulting matrix as distance in the following analyses.
We built variograms for each of the three types of distance matrices (geographical, environmental, resistance) and a fourth one with the fitted values from the GLS regression. We used the typical parameters to fit a model to the variogram: the range and nugget values were manually assigned to each variogram and the sill automatically fitted to the best value. We interpolated the four lineages to the full area of the simulated occurrences using the three distance matrices individually and a fourth one that is the estimated genetic dissimilarity by the regression of the previous distances.
We also tested the sensitivity of the phylin method to sample size. To that end, we repeated the procedure described above to predict the probability of occurrence of each lineage 16 times. Each time, we used a random set of samples—varying between 50 and 800, with a 50 sample increase. To evaluate the accuracy of each prediction, we used the true skill statistics (TSS). The TSS is a performance metric that weighs omission and commission errors independently from prevalence (Allouche, Tsoar, & Kadmon, 2006). It ranges between −1 and 1, where increasing positive values indicate increasing agreement of the predicted values with the observed, and zero or negative values indicate a random performance. The TSS requires binary data, and thus, the probabilities of each prediction were classified into presence/absence using as threshold the lowest achieved probability for the sampled data set. This allows all sampled species occurrences to be correctly classified. However, because we calculate the TSS using the full locations data set, known presences might be underestimated, lowering the metric value. The procedure described above was repeated 100 times for each sample size tests. We calculated the mean TSS and the 95% confidence interval using the quantile method.
Similarly to the simulated data, we tested the influence of the different isolation processes (isolation by distance, by resistance and by environment) with real world data based on the spatial patterns of genetic diversity in M. leprosa using three distance matrices: (a) a geographical distance based on the Euclidean distance between sample locations, (b) a resistance distance based on elevation data and water presence; and (c) an environmental distance as the Euclidean distance between the values of MTWQ at the sampled locations. We used the square root of cophenetic distances from the phylogenetic tree as the dependent variable.
We constructed a conductance surface, which is the reciprocal of resistance, (van Etten, 2017; McRae, 2006) and used it as an input in the package gdistance (van Etten, 2017) to derive the resistance distance matrix. We converted elevation data to conductance values by optimizing two parameters of the logistic function (Equation 4): the curve steepness and inflection point while the upper asymptote was maintained at 1 (maximum conductance). An optimization procedure using genetic algorithms allowed us to efficiently optimize a broad range of parameters within moderate execution times (Peterman, 2018; Scrucca, 2017). We used the function 'ga' in the ga R package version 3.0.2 (Scrucca, 2017) with constraints limiting the search within reliable intervals (steepness and inflection point within [−0.01, 0], [0, 2000], respectively) to estimate the best parameters for the logistic function. We set the curve steepness parameter as negative to force lower conductance at higher altitudes. In this optimization procedure, we searched the global optimum solution for the parameters of the logistic function that resulted in resistance distances that maximize the Pearson correlation with the genetic distances. For each iteration of the optimization algorithm, we generated a conductance matrix from which we derived the resistance distance matrix with the ‘commuteDistance’ function from the gdistance package using the mean as transition function for eight neighbouring cells. The transition matrix was corrected for the North‐South direction with the function ‘geoCorrection’ due to the usage of geographical coordinates. The lowest value of conductance achieved from application of the logistic function was attributed to the seawater presence in the study area. Thus, the final conductance layer had information for both land and water. We repeated the optimization procedure for 20 replicates. Optimal parameters were obtained from the replicate with the highest correlation score.
To evaluate the influence of geography, environment and resistance on the genetic differentiation, we fitted multiple GLS models with MLPE covariance structure. We included univariate models with each distance matrix, every combination pair of the distance matrices and the three distance matrices, resulting in seven models (Table 1). We ranked the models by the Akaike information criterion (AIC) value and tested the significance of the log‐likelihood ratio.
| Predictors | β Coefficients | df | AIC | △AIC | w | logLik. Ratio | p value |
|---|---|---|---|---|---|---|---|
| geo + env +res | −0.064, −0.033, 0.945 | 6 | 21,927.55 | 0.999 | |||
| geo + env | 0.562, 0.273 | 5 | 30,315.46 | 8,387.91 | 0.000 | 8,389.91 | <0.05 |
| geo + res | −0.063, 0.933 | 5 | 21,941.97 | 14.41 | 0.001 | 16.41 | <0.05 |
| env + res | −0.03, 0.893 | 5 | 21,991.69 | 64.13 | 0.000 | 66.13 | <0.05 |
| geo | 0.628 | 4 | 31,001.88 | 9,074.33 | 0.000 | 9,078.33 | <0.05 |
| env | 0.532 | 4 | 35,129.07 | 13,201.52 | 0.000 | 13,205.52 | <0.05 |
| res | 0.882 | 4 | 22,003.49 | 75.94 | 0.000 | 79.94 | <0.05 |
To account for uncertainty in the phylogenetic tree topology, we randomly sampled 500 out of the 10,001 available trees found in the posterior distribution from Veríssimo et al. (2016). For each of these trees, we calculated the square root of cophenetic distances using the ape package in R (Paradis et al., 2004), and used them to create the variogram. The variogram model parameters (sill, range and nugget) were estimated heuristically to achieve the best representation of the spatial structure. The interpolation area was defined with a 10‐km spatial resolution grid, covering the known distribution of M. leprosa in the Iberian Peninsula, Morocco, Algeria and Tunisia. We used the optimized parameters (see Results) to generate the resistance distances based in the altitude raster with the gdistance package (van Etten, 2017) as described above.
3 RESULTS
3.1 Simulated environment
The simulated occurrence data set contained 840 locations after converting the continuous response surface to binary response using a threshold value of 3.63 (classifies 50% of the available cells). Four lineages were defined by the phylogenetic tree (nlin1 = 222, nlin2 = 189, nlin3 = 203, nlin4 = 226), with lineages 1 and 2 occurring below the barrier and lineages 3 and 4 above. As expected due to the configuration of the environmental surfaces, connectivity (as depicted from conductance values) was low across the barrier, forcing higher similarity between lineages occurring on the same side of the barrier (Figure 1).

Variograms of each distance type were fitted with spherical models with the exception of environmental distance for which a Gaussian model was more suitable. The range and nugget values were chosen heuristically to produce the best fit (ageo = 2, c0geo = 0.3, aenv = 5, c0env = 0.65, ares = 19, c0res = 0, afit = 0.95, c0fit = 0). The variograms showed a strong spatial structure of the genetic distance and the theoretical models had a good fit to the empirical data (Figure S3). Variograms based on both resistance and fitted distances showed the effect of a strong barrier at middle distances, where genetic variance rapidly increases between pairs at opposite sides of the barrier.
The interpolated areas differed with the type of distance used (Figure 2). Environmental distances, with the smallest contribution to the genetic distance, showed a pattern of predictions with a low level of spatial aggregation of the probability, yet separating the upper and lower environments. The three other types of distance showed similar predictions, although geography alone provided a smoother interpolation around occurrences. The remaining two, resistance and fitted distances, showed that average to high probabilities were constrained by environmental features that constitute a barrier. With these distances a potential corridor through the barrier was also identified (Figure 2).

The accuracy of the predictions in relation to sample size was generally high for all types of distance used (Figure 3). The lowest values of TSS resulted from the interpolations calculated with environmental distances, which had the least influence in the simulated genetic distances. As seen in the spatial interpolations (Figure 2), environmental distance failed to predict the other lineage in the same environment, contributing to the lower TSS. Geographically based interpolations achieved high TSS scores but with high uncertainty for the smaller sample sizes, whereas resistance and fitted distances showed very high TSS scores with low uncertainty.

3.2 Empirical environment
The optimized replicates showed low variation on the correlation and parameters estimated (see Figure S4). The optimized parameters used to derive the resistance from elevation data using a logistic function were b = −2.21 × 10−3 and m = 416.58, for curve steepness and inflection point, respectively, and corresponding to the sixth replicate with correlation γ = 0.73218. The lowest conductance obtained from altitude after optimization was 1.97 × 10−3 and attributed to the water area. This sets a very high resistance while allowing passage through narrow areas such as the Straight of Gibraltar (few cells of sea water). The collinearity among predictors measured with Pearson's correlation score was γres/geo = 0.69, γres/env = 0.41 and γenv/geo = 0.22. The model with the three predictors was always significantly better than models with two or single variables (Table 1). Analysis of the coefficients from the GLS regression shows that resistance distances explain more of the genetic distance variance than the remaining variables. Additionally, models including resistance distances have lower AIC scores. Nevertheless, univariate models showed that all components individually have a positive correlation with the genetic distances and resistance distance is the one with the highest correlation (Table 1).
The variograms based on geographical and resistance distances were fitted with spherical models (Figure 4; ageo = 13, cgeo = 1.4, c0geo = 0.05; ares = 9, cres = 1.55, c0res = 0). The difference between the variograms using geographical and resistance distances occurred mostly at the longest distances. When based on geographical distances, the semivariance at larger distance lags tends to have smaller values than with the resistance‐based variogram. Thus the use of resistance paths increased the distance of samples that are geographically close, forcing the reshaping of the variogram. This is evident on distances between samples occurring in Tunisia and eastern Iberia’ which were relatively close geographically but most distant when considering resistance. Similarly, samples between Tunisia and southern Morocco that belong to the same lineage, and thus were genetically similar, had the highest geographical distance between them.

The prediction of each lineage occurrence using geographical distances was smoother across the study area than using resistance‐based distances (Figures 5 and 6). Although the spatial pattern of the probability of occurrence was generally similar between the two interpolation methods, the area with predicted higher probability for lineage B3 increased when considering resistance distance, whereas for lineage A3 it decreased. The lack of sampling for lineage B3 in Algeria influenced the interpolation using geographical distances and allowed a higher probability of occurrence of the Iberian lineage A3 in northern Africa due to the proximity of samples in south eastern Iberia to this region. This was counterbalanced in the resistance‐based interpolation by promoting the connectivity between Tunisian and Moroccan samples, rather than direct connections between Europe and Africa when the water barrier was ignored. The probable presence of lineage A2 was largely reduced in the resistance based interpolation due to lack of sampling.


4 DISCUSSION
4.1 Method improvements
In this study we illustrate the new features in the R package phylin. We have demonstrated how accounting for environmental and resistance distances greatly improved the predictive accuracy of the method. This was shown with the use of simulated data that allowed for testing the accuracy of the predictions under multiple scenarios of isolation mechanisms and sample sizes with full knowledge of each lineage's occurrence. We also showed how the addition of resistance distances aided the spatial interpolation of the genetic divergence within the distribution of M. leprosa. In this real‐world data analysis we used multiple genetic distance matrices from the posterior distribution of phylogenetic trees to display the uncertainty range associated with the semi‐variance at each distance class. The available parameters (sill, range, nugget) allow only rigid models that might be difficult to adjust, particularly with smaller sample sizes. The plotted uncertainty range allows a better estimation of the parameters and will be particularly useful for manual adjustment of the parameters.
In our simulated data, the genetic variation was linearly related to spatial and environmental variation. The direct consequence was that genetic variation was present even at small geographical distances that otherwise would be homogenized through generations’ creating spatial autocorrelation typical of biological data (Balkenhol et al., 2015). On the variogram, this effect was shown as a nonzero semivariance at the smallest distance lags (nugget effect). To further increase the realism of the simulations, genetic distances could be derived from the explicit simulation of movement and reproduction through generations of individuals under the resistance scenario (Balkenhol et al., 2015; Landguth & Cushman, 2010; Rebaudo et al., 2013), although it would heavily increase computing time. To counterbalance our straightforward simulations, we used a real case study with a negligible nugget effect. The lack of a nugget effect is particularly noticeable when using phylogenetic trees inferred from a few genes that provide limited genetic variation. However, there are situations where high semivariance is expected on short geographic distances. For instance, when sampling is biased towards a contact zone or a geographical barrier between allopatric/parapatric lineages, high genetic distance is expected at short geographical distances. This would render a variogram to which it would be very difficult to fit a model. However, if a barrier or an ecological gradient determines the contact between lineages, including resistance distances would increase the distance between divergent samples and thus facilitate the fit of the model to the variogram.
The calculation of resistance distances is dependent on the spatial properties of the study such as the spatial extent of the sampled locations, study area features and spatial resolution used (van Etten, 2017; McRae, 2006), whereas geographical distances between sampling locations have simpler calculation procedures. The additional load in processing with resistance distances often forces the spatial resolution of the analysis to be decreased, which can lead to suboptimal solutions (Shirk, Wallin, Cushman, Rice, & Warheit, 2010). This is because increasing the cell size often involves allocating samples of close locations to the same cell. As such, resistance distance between them would be considered equal to zero, resulting in those location pairs being collapsed at the lower left corner of the variogram. This problem can be minimized by using high‐resolution grids, but at the cost of increased processing time. Therefore, we recommend that a sensitivity analysis is performed to find the best trade‐off between cell size and processing time.
Sample size also had an impact on the predictive accuracy of each lineage occurrence. In general, a higher number of samples led to increased mean accuracy and reduced variance. This was expected because the randomly created data sets with low sample size often discard relevant information to correctly fit the variogram. For low sample sizes, the prediction method using resistance distances achieved higher than mean predictive accuracy. This was also expected given that with a lower number of occurrences there is a higher probability of discarding samples located at the edges of the lineage distributions, which are the most informative for correct interpolation of a lineage's limits using geographical distances alone. Regardless of the sample size, the method using environmental distance achieved the lowest predictive accuracy. The simulated ecological gradient favoured two different environments on each side of the barrier, and without any other information, the separation of lineages on each side was not possible. Although the interpolation with geographical distances accurately predicted the occurrence of lineages on each side of the barrier, it also attributed a high probability of lineage occurrence in the barrier due to lack of samples and no information about habitat preference. On the other hand, when including resistance, the barrier was well depicted, with high probability of presence being only predicted in an area within the distribution of each lineage.
The phylogeographical interpolation using geographical distances allows potential contact zones between lineages to be identified (Tarroso et al., 2015). The resistance distances are related to the landscape friction to movement and thus provide further information on the potential location of geographical barriers. Because the interpolation of lineage occurrence is calculated by weighting the indicator function of lineage presence, the lowest probabilities for one lineage occur in the neighbourhood of the other lineages (Tarroso et al., 2015). However, as no samples are available at the barrier per definition, all lineages are predicted to occur there with the same low probability due to the interpolation process. The product of each lineage's probability of occurrence will highlight the shared area with low probabilities that allows us to identify an area of high resistance that is impeding gene flow and strengthening the hypothesis of a potential barrier (Figure S5). Locating a potential barrier by mapping its area, in conjunction with the spatial probability of lineage occurrence, offers a path to locate potential corridors connecting populations. Adding resistance distances to the phylin method helps to identify which lineages use the corridor. This information is important to maintain diversity and to increase the resilience to climate change (Abrahms et al., 2017; Heller & Zavaleta, 2009). For an accurate location of the barriers, more informative genetic markers should be used together with exhaustive sampling. The presence of corridors connecting both sides of the barrier was also illustrated in our simulations. The configuration of the environmental surfaces created a potential corridor through the barrier connecting lineages 2 and 4. This is shown in the occurrence predictions of both lineages (Figure 2).
4.2 Biological implications
The distribution of M. leprosa is known to be restricted to terrestrial and freshwater habitats from low to medium altitudes. Both altitude and sea water increase the resistance to dispersion for this species (Veríssimo et al., 2016). The optimization procedure determined the parameters for a logistic model of resistance that provided the best fit for the presence of the species and the genetic data. However, two sampling effects might interact to bias the optimization. First, the sampling distribution might not reflect the full extent of the distribution and, thus, resistance might be overestimated at locations where the species might occur but observations are not available. Second, because the optimization is performed in relation to the genetic structure, only samples with molecular information are used, which might inflate the previous effect. Expert knowledge might provide better estimation of the parameters in those cases.
The genetic structure of M. leprosa was better explained by the isolation by resistance model. When tested individually against genetic distances, all three isolation models partially explained the genetic divergence but the resistance isolation model showed the highest coefficient. This was expected as topography and the seawater barrier greatly impacted the biogeographical history and contemporary patterns of genetic diversity and structure of the species (Veríssimo et al., 2016).
While the use of geographical distances tended to create smoother transitions, the inclusion of a resistance surface allowed for more accurate identification of areas limiting the distribution of lineages, such as barriers to dispersion. Altitude forced sharper limits in the distribution of lineage probabilities, and the inclusion of seawater with high resistance values funnelled the pairwise distances between Iberian and African samples through the straight of Gibraltar. This reinforced the connectivity between samples, even in the presence of substantial sampling gaps in Algeria. The populations of M. leprosa are highly structured with most of their diversity occurring in Morocco (Veríssimo et al., 2016). The connectivity within the B3 lineage was difficult to assess due to lack of sampling in Algeria (Veríssimo et al., 2016). The effect of existing genetic samples in southeastern Spain, which are geographically closer to northern Africa despite being separated by the Mediterranean Sea, forced a lower probability of presence to lineage B3 in Algeria. When including seawater as a barrier to gene flow, the expected connectivity between samples improved, even with the obvious lack of sampling in this region. Although the Mediterranean Sea is a permeable barrier, exchange between the two continents probably occurred at its narrowest passage in the Strait of Gibraltar (Velo‐Antón et al., 2012; Velo‐Antón, Pereira, Fahd, Teixeira, & Fritz, 2015; Veríssimo et al., 2016). Using the resistance information when calculating pairwise distances between samples is forcing the Strait of Gibraltar to act as the single point of connectivity between the two continents, resulting in larger distances between points separated by seawater and, thus, providing a better description of the genetic structure. Nevertheless, the use of resistance does not fully compensate for poor sampling. For instance, predictions for lineage A2 showed isolated populations with low probability of connection. Lack of sampling within the extent of this lineage, together with locations that are sympatric with other lineages, decreased the predicted probability between populations.
The heterogeneous topography created by the Atlas Mountains in Morocco was a major driver in shaping the genetic structure found in the African M. leprosa populations (Veríssimo et al., 2016). Including elevation as a resistance surface to predict the probability of occurrence of each lineage facilitated the identification of current lineage ranges and potential landscape corridors that allow connectivity between populations. However, high spatial resolution data are often needed to detect connectivity (Anderson et al., 2010). In this case, detecting dispersion through narrow corridors, as steep valleys in the mountain system, would benefit from high spatial resolution data that will also increase computing demand. Nevertheless, at the spatial resolution of this study, some samples that are geographically adjacent became isolated when using resistance information (Figure 6). These samples are at the extremes of the lineage distribution and thus, the range limits of their distributions are overestimated when using only geographical distances. This is less evident for lineage B1, although the most eastern population of this lineage is shown to be highly isolated from the population core.
4.3 Potential applications and final remarks
Prediction of intraspecific lineages is acknowledged as pivotal in recent conservation frameworks (Carvalho et al., 2017; Moritz, 1995; Rosauer et al., 2015). The inclusion of environmental and resistance distances in such predictions is highly beneficial because it allows a better predictive accuracy, and can further detect connection or isolation between populations (Rosauer et al., 2015). In addition, it allows us to test which landscape features shaped patterns of genetic structure and diversity within species (Wang, 2013).
The new additions to phylin allow for overcoming previous limitations of straight‐line geographical distances with the inclusion of resistance‐based distances in the interpolation process. However, the new features implemented in the phylin package are not limited to resistance distances. Any type of distance reflecting the spatial relationships between samples such as least‐cost paths or those including directionality (e.g., river networks) can be used. The method is independent of the niche of the species, but further inferences about the niche can be included depending on the user decision. For instance, niche‐defining variables (e.g., climatic, land cover or productivity indexes) can be included in the analysis with an optimization framework as we used here or through correlational distribution models that are used to derive distances between samples (Rosauer et al., 2015; Spear et al., 2010).
The potential application of this method ranges from biogeographical and evolutionary studies to conservation. Mapping geographical and environmental barriers to gene flow will help to better infer current patterns of species distributions and to delimit contact zones between divergent lineages. In the current version, the phylin method can be used to analyse temporal variation on the species/lineages distribution. If resistance or environmental distances are based on spatial layers with data for multiple time frames (e.g., past or future climate, recent land cover change, satellite imagery), the distribution of lineages can show that influence. When interpolating with phylin, the user can provide the resistance layers covering different time periods and, thus, the resulting layers will depict the impact of the change in the distribution. Conservation efforts, particularly the design of conservation strategies, are better informed when genetic variation and landscape features are integrated in a spatially explicit framework (Carvalho et al., 2017; Segelbacher et al., 2010). The current sequencing tools are providing a framework for the generation of enormous amounts of data that increase resolution at temporal, phylogenetic and spatial levels. Overall, the addition of neutral and adaptive variation and the use of variables with different temporal scales and distribution mapping in phylin will open new avenues for inference about the impact of climate change on biodiversity.
ACKNOWLEDGEMENTS
This work was funded by the European Regional Development Fund (ERDF), through the ‘Programa Operacional Factores de Competitividade – COMPETE' and by Portuguese National Funds through the Portuguese Foundation for Science and Technology (FCT) under the project PTDC/BIA‐BIC/3545/2014. PT, SBC and GVA were funded by FCT (ref. SFRH/BPD/93473/2013, SFRH/BPD/74423/2010 and IF/01425/2014, respectively). We are immensely grateful to Adam D. Marques for extensive review of the final draft. We also thank the three anonymous reviewers who provided insightful comments that helped to improve the manuscript.
AUTHOR CONTRIBUTIONS
PT conceived the study, conducted the research and wrote the application code and the first draft of the manuscript; GVA provided the empirical data for analyses. PT, SBC and GVA contributed with suggestions to improve the analysis and participated equally in the revision of the text.
DATA ACCESSIBILITY
This paper refers to phylin version 2.0. The simulated data set is included in the package as an example with a vignette detailing the instructions to perform interpolations with resistance distances. The package is made freely available through the Comprehensive R Archive Network (http://cran.r-project.org/) and through the source code repository (https://github.com/ptarroso/phylin). The empirical example uses freely available data from Veríssimo et al. (2016).
REFERENCES
Citing Literature
Number of times cited according to CrossRef: 1
- Jeffrey O. Hanson, Adam Marques, Ana Veríssimo, Miguel Camacho‐Sanchez, Guillermo Velo‐Antón, Íñigo Martínez‐Solano, Silvia B. Carvalho, Conservation planning for adaptive and neutral evolutionary processes, Journal of Applied Ecology, 10.1111/1365-2664.13718, 0, 0, (2020).




