Making virtual species less virtual by reverse engineering of spatiotemporal ecological models

The virtual species (VS) and virtual ecologist (VE) approaches are useful tools that allow testing different methodological aspects of species distribution modelling. However, methods used to generate VS so far lack solutions that can ensure a high degree of biological realism, taking into account spatial and temporal variability of population densities. We have developed a method for generating dynamic VS that can reconstruct their living prototypes in a realistic way. The framework consists of fitting a spatiotemporal model to real abundance data, generating a VS population from that model over the entire study area and spanning the whole study period, calibrating the VS, and obtaining the VE data by sampling from the VS. The effectiveness of the developed approach has been illustrated by data from large‐scale and long‐term bird abundance monitoring, using the whinchat Saxicola rubetra as a study system. We evaluated how well the spatiotemporal model can reconstruct the ‘true’ system by comparing response curves and population trends between those used to generate the VS (i.e. what constitutes the ‘truth’) and those estimated from the replicated instances of VE data. In addition, we performed a sensitivity analysis to test how the varying sampling effort affects the accuracy of trend estimation. The synthetic VE data thoroughly reconstructed the real monitoring data. Response curves from generalized additive mixed models (GAMMs), fitted to these two types of data, showed high concordance, as indicated by the 95% confidence intervals of coverage probability of 87.7%–99.8% (mean 96.9%). The population trend estimated from the VE data accurately reconstructed the ‘true’ trend calculated from VS (coverage probability: 82.3%). The proposed method for generating VS and VE data by reverse engineering of the spatiotemporal ecological model reproduces well the properties of the original system, substantially increasing the ecological realism of simulation results. The method may have further applications in evaluating various modelling techniques used to study species range dynamics, where real‐world properties are of particular importance, like conservation and invasion biology or climate change impact assessment.


| INTRODUC TI ON
Any methodology of learning about reality is subject to an unknown error. This uncertainty makes it virtually impossible to test the effectiveness of any method because, without knowing the truth, we do not know what the reference is. Fortunately, this seemingly insoluble problem can be solved by simulations: we can generate reality, which thus becomes known, providing a much-needed point of reference.
For this reason, simulations are fundamental in statistics, allowing one to explore model properties or test the correctness of calculated estimates (Bolker, 2008). Some of the early ecological research concerns the use of simulated data to compare estimation methods (Moilanen, 1999) or statistical tools (Paruelo & Tomasel, 1997) in a controlled and error-free way.
The virtual species (VS) approach is an analogous concept in ecological modelling. The idea involves generating a synthetic species distribution by defining species-environment relationships, projecting them into a real or abstract geographic space, and translating into occurrence or abundance data. The major advantage of using a VS is the possibility of taking simulated data as truth, which creates an opportunity to test the performance and efficiency of statistical models (Hirzel et al., 2001;Meynard et al., 2019;Meynard & Kaplan, 2013;Zurell et al., 2009Zurell et al., , 2010. VS have been widely used to study various aspects of species distribution models (Miller, 2014), such as: testing various sampling designs (Albert et al., 2010), methods for sampling bias corrections (Fourcade et al., 2014;Inman et al., 2021;Ranc et al., 2017;Stolar & Nielsen, 2015;Varela et al., 2014), different modelling techniques (Elith & Graham, 2009;Hirzel et al., 2001;Qiao et al., 2015), combinations of sampling design and modelling algorithms (Fernandes et al., 2018;Gaul et al., 2020), testing approaches to account for spatial autocorrelation (Dormann et al., 2007), to deal with multicollinearity (Dormann et al., 2013), estimating the effect of collinearity (De Marco & Nóbrega, 2018) or error types in abundance trends (Nuno et al., 2015).
Typically, such studies do not need to apply restrictively realistic assumptions, as the results are independent of the methodology used to design the VS. Many works use artificial species in an entirely synthetic environment (Austin et al., 2006;Crase et al., 2012;de Knegt et al., 2010;Lechner et al., 2012;Meynard & Kaplan, 2012;Santika, 2011). However, in some applications a higher ecological reality is necessary. Examples of such studies include spatially explicit models, concerning specific geographic locations, in which biogeographic patterns can be represented by e.g. real bioclimatic predictors (Albert et al., 2010;Barbet-Massin et al., 2012;Miller, 2014).
Another example is research concerning a specific species, in which it is important to characterize realistically its ecological niche, habitat suitability or population dynamics (Fernandes et al., 2018).
Applied and conservation ecology also requires realistic site-specific and species-specific models, so that their outputs can potentially be transferred to the real world.
Several approaches are employed to generate virtual species, including the most common niche synthesis method (Hirzel et al., 2001), the pick-mean or median method (Jiménez-Valverde & Lobo, 2007;Lobo & Tognelli, 2011) or process-based modelling ; see an overview of the VS approaches in Supplementary File 2). However, it is worth noting that most of the VS generation methods are based on presence-absence data (Table S2, Supplementary File 2), with abundance data, even synthetic, used in few studies, for example, Dambly et al. (2021), Pagel and Schurr (2012) and Pagel et al. (2014). Insufficient access to high-quality population abundance data might explain this tendency. Additionally, most of the VS studies consider only the spatial dimension. Dynamic spatiotemporal methods are less common and were used by, for example, Pagel et al. (2014) or were designed to simulate population dynamics in some R packages, like RangeShiftR Malchow et al., 2021) or poems (Fordham et al., 2021). The implementation of both temporal and spatial components is a crucial aspect that may provide essential information on range dynamics and is particularly important if we want to understand the mechanisms of responses of organisms to the ongoing global change.
Our proposed approach is similar to the idea proposed by Thibaud et al. (2014) and Fernandes et al. (2018), in which VS retains the original species-environment relationship. However, it is enhanced by adding several important components, which increase its ecological realism, such as the possibility of modelling spatiotemporal density data, accounting for both spatial and temporal autocorrelation, and incorporation of meta-parameters reflecting the nested structure of the data (Table S3, Supplementary File 2). Moreover, VS studies simulating real populations most often rely on monitoring data collected by volunteer observers, whose skills in detecting and identifying species vary (Dickinson et al., 2010). The inclusion of a realistic sampling error brings the proposed approach even closer to the original system. Finally, the proposed method introduces randomness at several levels, which mirrors the random variation in the real system. This property appears to be of particular importance, as it allows to replicate the entire procedure, producing distributions of parameters that allow formal statistical inference.
We present a framework to simulate a VS that can reconstruct its living original in a more realistic way. We propose a reverse engineering of spatiotemporal models fitted to real abundance data, K E Y W O R D S biological realism, model calibration, simulation, spatiotemporal models, species distribution modelling, virtual ecologist, virtual species calibrated, and combined with a simulated observational process, to generate synthetic data that vary in both space and time, in a way that resembles patterns generated by ecological surveys. To demonstrate the versatility of this concept, we used large-scale and long-term bird monitoring data as a prototype for generating VS and then showed how this approach can be applied to assess the effectiveness of a routinely used statistical technique to reconstruct original habitat response curves and population trends.

| General framework
Our proposed modelling framework is designed to accurately reconstruct the original system and consists of the following six main steps ( Figure 1): 1. fitting a model to real spatiotemporal population abundance data, 2. generating a dynamic VS by a simulation from the above model parameters, preserving its main properties, such as a. environment-abundance relationships and the covariance structure of environmental variables (defined by the deterministic part of the model, i.e. by the vector of fixed parameters and their variance-covariance matrix), b. spatiotemporal autocorrelation (modelled as a smooth Gaussian process), c. random spatial variation (represented by random intercepts of sampling plot identifiers) and random temporal variation (random intercepts for years), d. distributional characteristics of abundance (defined by the random part of the model, including residual variance and dispersion), 3. applying the VE approach mimicking the observational process by sampling the VS (by replicating the original survey scheme) with an error rate corresponding to random intercepts for observer identifiers estimated in Step 1, 4. calibrating VS by mapping its frequency distribution, using relationships between simulated and observed data, 5. fitting a model to the VE data by using the same approach as in Step 1, 6. evaluating the VE model obtained in Step 5 by comparing its predictions with the VS data obtained in Step 4.

F I G U R E 1 Diagram presenting the proposed workflow.
Step 1-modelling: fitting a statistical model to real spatiotemporal abundance data.
Step 2-simulating: generating virtual species from the model's predictions with the inclusion of the main properties of the original system.
Step 3-virtual sampling: obtaining virtual ecologist data by replicating the original sampling scheme, including observational error.
Step 4-calibrating: mapping frequency distributions, using relationships between simulated and observed data.
Step 5-modelling: fitting the same statistical model as in Step 1 to the virtual ecologist data.
Step 6-evaluation: measuring the discrepancy between the expectations from the virtual ecologist model and the 'true' (i.e. virtual species) data.

Virtual sampling
Terminal node

Process
Step 1 Step 2

Fitting Predicting Fitting
Step 5 Step 3

Fitting Fitting
Simulating

Legend
Step 4

Calibrated Virtual Ecologist data (VE')
Step 6 Virtual sampling Predicting Sampling Expectations 2.2 | Step 1: Fitting a spatiotemporal model to the real data

| The data
To apply the proposed approach, the abundance data need to be replicated in space and time. Ever-growing database resources are fulfilling this requirement, of which the best examples are data collected during the implementation of biodiversity monitoring schemes or public science projects (Brlík et al., 2021;La Sorte et al., 2018;Sullivan et al., 2014;Tulloch et al., 2013).
The same applies to environmental data used as predictors. The usual species distribution model approach of using static variables will not provide the relevant information that will allow us to model the spatiotemporal variation in abundance. Thus, at least some and preferably most of the environmental information should be timevarying. Spatiotemporal environmental information is widely available in many open-access databases containing data on, for example, climate or land cover (Abatzoglou et al., 2018;Witjes et al., 2022).
To illustrate the method, we used abundance data for the whinchat Saxicola rubetra, from a long-term monitoring scheme (20 years: 2000-2019) on a large scale (>300,000 km 2 ). The whinchat is an insectivorous migratory bird that inhabits open landscapes in Europe.

The dataset originates from the Common Breeding Bird Survey in
Poland (MPPL, 2022). Counts were performed on 1-km 2 survey plots, using the distance sampling protocol (Buckland et al., 1994(Buckland et al., , 2001(Buckland et al., , 2004. Collectively, whinchats were detected each year in 38.3%-53.3% of the plots (mean prevalence 45.5%), with a mean density of 3.8 pairs/km 2 . The number of survey plots varied between years from 122 to 736 (on average, 517 plots per year).
Seventeen environmental predictors, obtained from three publicly available databases, were used to model species abundance.
They provide information on topography from SRTM 90m DEM Digital Elevation Database (Jarvis et al., 2008), land cover from ESA CCI Land Cover (Defourny et al., 2017(Defourny et al., , 2021 and climate from TerraClimate (Abatzoglou et al., 2018). The three predictors that describe topography were static, whereas others varied in time.
All the details concerning the field survey methodology, data treatment, and the set of variables used as predictors are presented in Supplementary File 1 (Table S1).

| The model
Any modelling technique can be used to fit the spatiotemporal model. Here, we used generalized additive mixed models (GAMMs; Lin & Zhang, 1999;Wood, 2017), as they produce smooth response curves, can handle many distributions, and flexibly allow modelling processes that improve ecological realism, like autocorrelation, random factors and so forth. Population density (either observed in reality or observed by VE) was used as a response, assumed to originate from a compound Poisson-Gamma (Tweedie) distribution with the log link function. This approach allows modelling non-negative real numbers with a spike at zero (which is the case in most population density data) and can effectively account for overdispersion.
To narrow the list of predictors and facilitate interpretation, model selection was performed using a backward elimination algorithm, as recommended by Marra and Wood (2011). In our model, both species and environmental data are allowed to be dynamic, that is, vary in both space and time, Here, population density N i,t , which is assumed to come from the Tweedie distribution with index parameter and dispersion parameter , is indexed by site identifier i and year identifier t. The natural logarithm of the expected density i,t is the sum of smooth functions fitted to m predictors that vary in both space and time. Additionally, the model is enhanced by including several components that represent the following processes: 1. Temporal autocorrelation (representing the population trend by allowing a smooth, serially autocorrelated transition of population density over time), modelled as a Gaussian process  with the mean f t , which itself is modelled as a smooth function of time, and the covariance matrix u, describing temporal autocorrelation structure. The elements of matrix u are modelled as a function of a between-year distance by using the Matérn correlation function c 0 with parameter t (which was set to 0.5 to reflect a short-term serial autocorrelation between observations of the time series).
2. Spatial autocorrelation (representing dispersal or large-scale habitat gradients), modelled as a Gaussian process , with the mean f s , which is a smooth, 2-dimensional function of geographic coordinates (denoted as x) and covariance matrix v, describing a spatial autocorrelation structure. The elements of matrix v are modelled as a function of geographic distance between sampling sites by using the Matérn correlation function c 0 with parameter x .
3. Random spatial variation (representing spatial variability in population densities that are not captured by other processes), modelled as a normally distributed random intercept a i with a mean at zero and variance 2 plot . 4. Random temporal variation (representing between-year variability) was modelled as a normally distributed random intercept b t with a mean at zero and variance 2 year . 5. Observational process (representing differences between observers) was modelled as a normally distributed random intercept o i,t with a mean at zero and variance 2 obs .
The model presentation in a form of ODMAP protocol can be found in Supplementary File 3 (Zurell et al., 2020).

| Step 2: Generating a virtual species (VS)
The VS was generated for the whole target area (i.e. the country) as follows: 1. First, predictions from the above model were calculated on a link scale, using selected habitat variables, autocorrelated spatial and temporal components, and a random effect for year. However, random intercepts, for both sampling sites and observers, were set to zero. This was necessary because when fitting the model, random intercepts could be estimated for real sampling sites only but not for all sites within the target area (however, their distributional characteristics could be estimated-see the next step). Random intercepts for observers were set to zero to remove the observational error at this stage-although they were reconstructed later (see Section 2.4 below).
2. Then, random intercepts for all squares within the study area were simulated by drawing a random sample from a normal distribution with a zero mean and variance 2 plot estimated while fitting the GAMM. These intercepts were added to the predictions (on a link scale) calculated in the previous step.
3. Next, the inverse link function was applied, giving expected densities for each square.
4. Finally, using estimated parameters of the Tweedie distribution, that is, index parameter and dispersion parameter , a single random draw was generated from this distribution by using expected densities obtained in the previous step.

| Step 3: Virtual ecologist (VE)
The VS generated according to the above procedure was taken as truth. Next, we attempted to reconstruct the sampling process, to replicate a real sampling scheme by using exactly the same set of sites, which were sampled exactly at the same time as in reality. The sampling error p, associated with individual observers, was generated as a random draw from a lognormal distribution with a mean of 1 and variance 2 obs , which was estimated when fitting the GAMM at Step 1. This error allows virtual observers to underestimate abundance (if p < 1, due to imperfect detectability, for example), and also to overestimate it (if p > 1, for instance due to double counting).
After that, the densities of VS 'observed' at the original sites by virtual observers were multiplied by a relevant p.
VE data were evaluated against the real data to assess whether the above procedure can reconstruct effectively the statistical properties of the observed population densities. To do so, we compared their distributions by replicating 1000 times the VE generation procedure (Steps 2-3) and then inspected if the patterns generated from a statistical model (VE data) can mirror the patterns in the real survey data. Specifically, we calculated 95% empirical confidence intervals (CI) from the simulated distributions and checked if they contain the truth, that is, a relevant statistic obtained from the observed data. The test statistics used were: arithmetic mean of population density (to assess location), its standard deviation (to assess dispersion) and species prevalence (to assess the excess of zeros).
Additionally, since the above statistics represent just some aspects of the compared distributions, we also calculated empirical quantiles and regressed the simulated against the observed, producing Q-Q plots. If the compared distributions are compatible, they should not deviate substantially and systematically from the equality line (i.e. the function f(x) = x).

| Step 4: Calibrating the VS and VE data
Simulation experiments by repeated generations of VS and VE data (Steps 2-3) revealed biases in the proposed method ( Figure S2, Supplementary File 2). Generally, expected population densities seem to be correctly estimated but the variance tends to be underestimated and the prevalence is definitely strongly overestimated.
To correct these biases, we applied a transformation called quantile mapping (QM). This approach is routinely used in, for instance, climatological research for post-processing predictions obtained from regional climate models (Li et al., 2010;Pierce et al., 2015).
Other applications, among many others, include calibration of the investment portfolios (Sgouropoulos et al., 2015) or image processing (Grundland & Dodgson, 2005). The goal of QM is to find a func- QM can effectively reduce the bias in simulation results but it cannot fully accommodate for an excess of zeros. Thus, applying QM mapping alone equalizes Q-Q plots and corrects the simulated data in terms of both location and dispersion, although the prevalence still remains heavily biased ( Figure S3, Supplementary File 2). This is because the process of habitat selection, which can be viewed as a function projecting habitat suitability into species abundance, returns non-negative integers (the number of individuals cannot be a fractional number). Hence, patches with habitat suitability being well below unity actually cannot sustain any single individual, even though the expected abundance is greater than zero. For this reason, the assignment of zeros to values of very low predicted population densities seems to be a reasonable way of correcting for the bias in prevalence.
To sum up, we propose the following transformation that maps simulated abundance into observed one where f is the composition of two functions: h and g. The operator "•" is used to denote a function composition, which is the process of applying one function to the output of another function (Ross, 2013).
Here, the function g is applied to the result of applying the function h to S, where h is the smooth function fitted to the Q-Q relationship (Equation 1), and g is a step function: it is the quantile of x that corresponds to the amount of zeros in the observed data.
To put it simply, the function h applies the Q-Q mapping to simulated data, and subsequently the result x is passed to the g function, which assigns zeros to all values in x that are not greater than Q 0 , which is the quantile of x corresponding to a probability level that is equal to the probability of obtaining a zero in the observed data. This way, the fraction of the lowest values from the mapped simulated data that are set to zero, complements the observed prevalence (be- The above transformation requires the target distribution to be known. In the proposed framework, the idea is to generate a VS,  To fit ŝ and ŝ −1 , we propose to employ QM by approximating the Q-Q function using GAM, and to fit f we propose to apply the method described above and summarized by Equation 2. In practice, calibration of VS (which requires the second simulation run) is not always needed. When the goal is to replicate a certain sampling scheme, generation of the VS for the whole study area and the whole study period could be skipped (with huge savings in computational time). Then, generation of VE data is sufficient, and these can be calibrated directly by mapping into observed data.

| Step 5: Fitting a model to the VE data
After applying Step 4, we obtained the calibrated VE data, being a replication of the original data. The truth behind them is now known, Third, to demonstrate the potential of testing hypotheses concerning the observational process, we estimated population trends by using subsets (varying in size) of sampling sites. We considered 10 subsets, and their sizes were evenly spaced (on a log scale) from 10 to 1110 (which was the number of sites sampled altogether). The same model was fitted to these data subsets, and the entire procedure (i.e. generating the VS, sampling it and estimating trends) was replicated 1000 times, in each iteration storing the 'true' VS trend, estimated VE trends ( ) and their standard errors. Then, to assess the trend bias and precision, we calculated: (1) the absolute deviation of the estimated trend from its 'real' counterpart; (2) the coverage, that is, the percentage of cases when the 'true' trend was inside 95% confidence interval of the estimated trend; and (3) the power, that is, the probability of classifying correctly the estimated trend as belonging to at least the 'moderate decline' category, which was the actual category of all generated VS (their trends ranged from −4.87 to −4.82, n = 1000). Trend classification was adopted from the Pan-European Common Bird Monitoring Scheme (Brlík et al., 2021), which defines 'moderate decline' as a change of no more than −5% per year and, simultaneously, with its upper 95% confidence limit between 0.95 and 1.00.

| Step 6: Model evaluation
Reconstructed response curves were compared with those used to generate the VS by calculating the coverage probability (% of the time when the 95% CI of the fitted response curve contains the true value). The same procedure was applied to evaluate estimated population trends. The 'true' population trend was calculated as the slope of yearly log-transformed population density means taken from the VS. Trend estimates were multiplied by 100 and therefore represent a percentage of yearly population change.
The proposed algorithm of generating the synthetic data is capable to introduce a random variation at three levels: (1) while generating random intercepts for all squares within the target area, (2) when drawing a random sample from a population density distribution and (3) when allocating a random sampling error to individual observers. As a result, there is a potential for generating multiple VS and VE data while preserving the distributional characteristics of the prototype model. Therefore, statistical inference becomes possible, as all the statistics that are used for model evaluation come with associated probability distributions. To implement this, we replicated the above procedure 1000 times and consequently we report the resulting summary statistics together with their 95% Monte Carlo confidence intervals (Buckland, 1984

| Reconstructing response curves
In our whinchat case study, VS was generated using 10 variables (out of 17) selected by a backward elimination algorithm while fitting a GAMM to the real data. When applying the same procedure to the VE data (replicated 1000 times), 5-14 variables were selected (mean 9.3, 95% CI: 7-12), from which on average 3.0% (95% CI: 0.0-17.6) were false positives (variables that were selected as being significant at the 0.05 alpha level, whereas they actually were not used to generate the VS). Besides, 7.0% (95% CI:0.0-17.6) were false negatives (variables used to generate the VS that were incorrectly selected as insignificant). Hence, the mean correctness of variable selection by using the , backward elimination procedure was 89.9% (95% CI:70.6-100.0; Considering only correctly classified variables, the concordance of fitted and 'true' response curves was high: the mean coverage rates ranged, depending on the variable, between 87.7% and 99.8% (Figures 4b and 5).

| Reconstructing population trends
Both the 'true' VS population abundance data, and 'observed' VE data indicate the same temporal pattern (Figure 6a). The mean coverage rate of the estimated abundance index was 82.3% (95% CI:19.9-100.0). The VS population is declining dramatically: the 'true', long-term rate of change was −4.80% per year. The mean 'observed' trend estimated from replicated VE data was −4.96%, and its 95% CI (from −5.58% to −4.01%) contained the 'true' rate.

| Effect of sampling effort on trend estimation
With increasing sample size, the mean absolute error rate decreases from about 2% (for a sample size of 10 sampling plots) to virtually zero (when there are more than 1000 sampling sites). At the same time, the variance of the estimated trend decreases with increasing F I G U R E 3 Comparison of parameters of population density simulated from the model (grey) against the observed parameters (black). (a) Probability density distribution of log-transformed simulated virtual ecologist data plotted against the density distribution of observed data (black outline). (b) Q-Q plot: grey lines are the simulated quantiles plotted against the observed quantiles, dashed lines represent 95% empirical confidence interval (CI), and the black line is the reference (i.e. if the distribution of simulated data is compatible with that of the observed data, quantiles do not deviate substantially from this line). (c) Frequency distribution of the simulated mean (grey bars), 95% CI around this mean (vertical black dashed lines), and the observed mean (solid black line). (d) Frequency distribution of the simulated standard deviation, along with 95% CI and the observed reference. The frequency distribution of simulated prevalence is not shown, as after applying the calibration, it equalled the observed value by definition.
Population log-density   Wetness sampling effort, as indicated by the shrinkage of the confidence intervals ( Figure 7a). To be highly certain that the estimated absolute error rate of the trend is not greater than 1%, one needs to survey several hundred sites. There is a clear, albeit not very large (approximately 2%) undercoverage of trend estimates when the number of sites is low (tens or several dozens). When more than 100 sites are sampled, the undercoverage is no more an issue (Figure 7b). Power analysis indicates that about 100 sampling sites seem to be sufficient to classify the trend correctly (Figure 7c). To sum up, this analysis shows that when the study objective is just to classify the trend and to ensure its fair coverage (as it is in the most conservation-oriented work), at least 100 sites seem to be sufficient. However, when the study objective is to estimate precisely the population trend (information that is crucial for building demographic models, making population projections, etc.), then several hundred sites need to be surveyed.

| DISCUSS ION
We provide a description of a framework designed to generate a VS whose population varies in space and time in a way that thoroughly F I G U R E 6 (a) The 'true' population trend (black solid line) and fitted trends (multiple grey lines). The shaded region results from overlaying the trends fitted to replicated instances of virtual ecologist data. Dashed grey lines indicate 95% confidence interval (CI) for replicated trends. The abundance is relative, i.e. it is expressed as a fraction of the abundance in the first year of the study. (b) Frequency distribution of trend coverages (i.e. percentages of years when the 95% CI of the fitted trend contained the 'true' value calculated from the virtual species) obtained from a replication experiment. The confidence intervals are one-sided, so they represent the hypothesis that (a) the absolute error rate is greater than zero, (b) the coverage is at least 95%, and (c) the trend is correctly classified as 'moderate decline' or more negative (e.g. 'steep decline'). (c) resembles its prototype. Moreover, on a large-scale and long-term data set we demonstrate the usefulness of this approach for testing the efficiency of statistical tools that are routinely used for inferring about various ecological phenomena.
Our results show that with the use of the proposed generation algorithm, a VS can be created on the basis of real spatiotemporal abundance data, with the possibility of replicating the fundamental features of the original system. Using a procedure that can be connoted with 'reverse engineering', we showed how to disassemble a spatiotemporal statistical model into its functional elements (representing ecological processes, such as observational error, spatiotemporal autocorrelation, random temporal or random spatial variation) and then how to reassemble these components into a fully synthetic, yet operational representation of a real population.
The major advantage of our approach is its flexibility (see Supplementary File 4 for an additional case study). The framework has been closely adapted to specific bird monitoring data, but the general idea can be applied to any data, possibly with a different structure.
The user can set different sampling schemes, either by replicating the original one or by defining a custom system from scratch. Different structures of observational error can be incorporated, for instance those accounting for imperfect detection only (using a binomial distribution) or those allowing both omission and commission errors (using a lognormal distribution, as we did in our example). The framework can be further extended to suit actual research needs and therefore it could be useful for spatiotemporal ecological studies. After all, introducing more realism into VS simulations is a way to increase the applicability of the results in the real world.
The main limitation of our study is that the proposed approach is phenomenological. It can reconstruct the system that generated the observed data but gives limited insight into the mechanisms behind it. Similarly, the approach can be used to reconstruct the original biological system, but the user still has no control over the quality of information that is received at the input and whether it has been properly sampled. However, some biases can be accounted for during the process of generating the VS, such as the observation error or random spatial and temporal variation, and it is possible to evaluate the procedure at different stages of the process. A 'reverseengineering' approach has been used to incorporate elements that should provide higher realism of the simulated species, but we cannot be sure that the reconstructed system is consistent with reality, as the reconstruction is based on only some properties of the observed world. However, this problem opens up an epistemological debate on how to demonstrate the predictive reliability of ecological models and if the quality of a model can be evaluated by other means (Oreskes, 1998).
Another limitation, when using species distribution models, could be a lack of high-quality spatiotemporal data, both environmental and on population abundance. However, as mentioned earlier, this is changing, as both kinds of data are increasingly available not only from detailed small-scale surveys but also from many wildlife monitoring programmes, public science projects (Dickinson et al., 2010) and environmental information based on large-scale, high-quality remote sensing.
Our results cast new light on the aspect of adding realism while generating a VS, an issue that has already been addressed by some authors (Elith & Graham, 2009;Thibaud et al., 2014). Another promising approach would be to incorporate processes underlying species distributions, which are necessary for a realistic representation of virtual species. This can be done by using mechanistic or individual-based models. Some toolkits and packages allow users to include these aspects, e.g. dispersal limitations (Leroy et al., 2016;Qiao et al., 2016) or genetic and evolutionary components , when generating virtual species distribution. These comprehensive approaches, incorporating the key ecological processes (that determine species ranges) into the simulation, as well as the one proposed here, are necessary for a better integration of statistical modelling and ecological theory. The study was supported by the National Science Centre (NCN) in Poland (grant no. 2018/29/B/NZ8/00066). We are grateful to Michał Wawrzynowicz for his help in testing the framework.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https:// w w w.w e b o f s c i e n c e . c o m /a p i /g a t e w a y/ w o s /p e e r-r e v i e w/10.1111/2041-210X.14176.

DATA AVA I L A B I L I T Y S TAT E M E N T
The code and vignettes are available at the GitHub repository https://github.com/popec ol/Virtu al-Species and at the Zenodo repository https://doi.org/10.5281/zenodo.8004779 . Data are deposited in the Zenodo archive under the link: https://doi.org/10.5281/zenodo.7732648 (Malinowska et al., 2022).