Electronic Supplementary Material distantia: an open-source toolset to quantify dissimilarity between multivariate ecological time-series

There is a large array of methods to extract knowledge and perform ecological forecasting from ecological time‐series. However, in spite of its importance for data‐mining, pattern‐matching and ecological synthesis, methods to assess their similarity are scarce. We introduce distantia (v1.0.1), an R package providing general toolset to quantify dissimilarity between ecological time‐series, independently of their regularity and number of samples. The functions in distantia provide the means to compute dissimilarity scores by time and by shape and assess their significance, evaluate the partial contribution of each variable to dissimilarity, and align or combine sequences by similarity. We evaluate the sensitivity of the dissimilarity metrics implemented in distantia, describe its structure and functionality, and showcase its applications with two examples. Particularly, we evaluate how geographic factors drive the dissimilarity between nine pollen sequences dated to the Last Interglacial, and compare the temporal dynamics of climate and enhanced vegetation index of three stands across the range of the European beech. We expect this package may enhance the capabilities of researchers from different fields to explore dissimilarity patterns between multivariate ecological time‐series, and aid in generating and testing new hypotheses on why the temporal dynamics of complex‐systems changes over space and time.

Longitude Latitude Figure A1: Map of pollen sites used in this case study.

Data preparation
First, we use prepareSequences() to apply a Hellinger transformation to the pollen counts, which helps to balance the relative importance of abundant and non-abundant pollen types.

Computation of dissimilarity among sequences
Second, we use workflowPsi() to compute the dissimilarity between each one of the 36 combinations of sites, and workflowNullPsi() to compute the probability of obtaining a given ψ on restricted permutations of the input sequences (null model of the given ψ). The restricted permutation is applied locally, and independently by column. On each data-point, the algorithm decides randomly to either leave it as is, or switch it with the data-point above or below it. The outcome allows to assess the probability of obtaining the observed ψ value by chance on the non-permutated sequences. The result of both functions is shown in Table A2.   Figure A2: Clustering of the Eemian pollen sites based on the dissimilarity matrix.  Figure A3: Similarity between pollen sites. Colors represent latitude, and edge thickness represent similarity, computed as 1/dissimilarity to highlight stronger similarities. Figure A6 shows that there is a large cluster of sites that share similarity with Klentia Stara (K_S), a closely related cluster composed by Krumbach I (K_I) and Jammertal (Jmm), and an isolated group, Achenhang (Ach), which is highly dissimilar to the other sites, probably due to the presence of two hiatuses in the sequence (Grüger 1983).

Relationship between dissimilarity and geographic distance between sites.
We hypothesize that pollen assemblages during the Eemian follow the main axiom of geography, and therefore, sites that are close to each other should have higher similarity than sites that are far apart from each other. To evaluate this hypothesis, we first have to compute the geographic distances between sites from their coordinates and differences in elevation. Second, we model dissimilarity as a function of distance, difference in elevation, and difference in latitude with a GLM with Gaussian errors. Model selection is based on AIC values of the model considering all variables and the models with each variable.   Dist. (km) Figure A4: Relationship between dissimilarity and difference in elevation between pollen sites.

Comparing two sites
The sites Jammertal and Achenhang, are among the sites with a more disproportionate relationship between dissimilarity (ψ = 2.507, p = 0.986), distance (188 kilometres) and elevation difference (233 m). Considering their geographical positions, why are these sites that different? The function work-flowImportance allows the assessment of the drop in dissimilarity (in percentage of the dissimilarity when all variables are considered) when a given variable is removed from the data. Variables with a positive psi-drop contribute positively to the dissimilarity between two sequences (the compared sequences become more similar when the variable is removed). The code below shows how the function is applied to both sequences.   between Achenhang and Jammertal. The samples of these sequences are not aligned in depth, and therefore potential shape distortions may apply when comparing them.

Functional dynamics of three populations of Fagus sylvatica 1.2.1 Data
For this example we selected three sites that, according the EU Forest database (Mauri et al. 2017), are mono-specific stands of Fagus sylvatica. Two are located very close to the coast, one in Northern Spain and another in Southern Sweden, while the remaining one has a continental setting in Western Germany.  Figure A7: Monthly EVI, rainfall, and temperature of three Fagus sylvatica forests in different countries.

Comparison of complete sequences
The general objective is to understand how the dynamics of the system varies among sites. A first approach consists of comparing the three complete sequences. Below we apply a comparison by shape using the workflowPsi() function, and compute ψ values on restricted permutations of the datasets to assess to what extent the dissimilarity between sequences can be due to chance.  According to the ψ values, the sites from Germany and Sweden have the highest similarity, followed by Spain and Germany. The comparison between Spain and Sweden yielded a ψ value above the null expectation (average of ψ values computed on restricted permutations of the data), and therefore the results of this comparison can be considered unreliable. It is important to take in mind that the restricted permutation applied by workflowNullPsi() only moves randomly selected datapoints one cell up or down, independently by column, and therefore it may yield relatively high p values and low null expectations even when the signal is relatively strong. Therefore, p values and null expectations should be considered as general indicatives of the strength of the relationship, and not as a representation of statistical significance.

Contribution of each variable to dissimilarity
Below we use the function workflowImportance() to assess how the variables temperature, rainfall, and EVI contribute to the dissimilarity between the dynamics of the three populations.  The table above shows the effect on dissimilarity of removing one variable at a time from the compared datasets. EVI plays an important role in increasing the dissimilarity between the German and the sites, but these have been confirmed to be dissimilar beyond the null expectation, and therefore the output of the importance analysis might not be accurate.

Comparison of dynamics year to year
How does the dissimilarity between the dynamics of each pair of sites change over the years? Were they equally dissimilar each year within the study period? If so, what variables contributed the most to the dissimilarity?
The code below groups the data by site and year, computes ψ for each combination of these, and organizes the output to represent it graphically.

Sensitivity of the dissimilarity metrics implemented in distantia
The distantia package implements an elastic dissimilarity measure with two distance-minimization options: elastic (comparison by shape) with diagonal and non-diagonal search of the least cost path, and a lock-step (comparison by time). It also implements several distance metrics, of which Euclidean and Manhattan are the most general ones. Furthermore, depending on the nature of the data, input sequences can be transformed with one of the following options: "scaled" (scaling and centering), "proportion", "hellinger" (square root of the proportions).
In this section we explore the effect of these different options on the outcome of the dissimilarity analyses that can be carried out with the distantia package. The analyses will be based on two example datasets provided with the package: climate, and pollenGP. The former is a climatic dataset with 800 observations of 4 variables with different units (average temperature and rainfall, and temperatures of the warmest and coldest month) generated by a palaeoclimatic simulation (Willeit et al. 2019).
The data transformations applied to this dataset will be "none" (the data with their natural values), and "scaled" (scaling and centering through the scale() function of the R software). PollenGP is a pollen dataset with 200 observations of 40 pollen types (Woillard, 1979). In this case we use the data transformations "proportion" (relative abundance of each taxa) and "hellinger" (square root of the proportions).
Two different scenarios of potential differences between the datasets and modified copies of themselves will be explored.
In Scenario 1 we increase the differences between a dataset and its copy by modifying an increasing number of datapoints (intersection between a row and a column). This analysis involves creating a copy of a dataset, selecting a variable, selecting a case, changing its value, and comparing the original dataset with its copy through different combinations of method (elastic diagonal, elastic nondiagonal, lock-step), transformation ("none" and "scaled" for climate, "proportion" and "hellinger" for pollenGP), and distance metrics (Manhattan and Euclidean). The process is repeated a 30 times for each number of datapoints modified, and the mean and standard deviation of the resulting ψ values are reported.
In Scenario 2 we remove an increasing number of randomly selected rows of the copy dataset to assess how differences in sample size impact the dissimilarity between two otherwise identical datasets through combinations of method (lock-step and elastic with and without diagonals), transformation ("none" and "scaled" for climate, "proportion" and "hellinger" for pollenGP), and distance metrics (Euclidean and Manhattan).

Scenario 1: effect of number of modified data-points
We plot changes in ψ mean and standard deviation across number of data-points modified for each method and combination of transformation and distance metric. We found that lock-step and elastic-diagonal methods to compute dissimilarity are practically equivalent, specially when the data are scaled and the differences between datasets are small. Both methods were, as expected, more sensitive to dissimilarity than the elastic-orthogonal method, especially when differences between the datasets were small, and the distance metric used was Euclidean. The adaptive nature of the elastic-orthogonal method also increased the standard deviation of ψ values with larger differences between the compared datasets.

Scenario 2: effect of difference in sample size between the compared sequences
The analysis of sensitivity of elastic measures to changes in the number of rows of one of the compared datasets shows that ψ increases exponentially with increasing number of samples removed from the copy dataset. This finding indicates that, even when two sequences have the same overall dynamics, their dissimilarity scores can be highly inflated, particularly when one of the sequences has less than half the number of samples of the other sequence. The orthogonal method showed to be less sensitive to this effect, while transformation and distance metric seemed to be irrelevant.