1. Artificial neural networks (ANNs) are known for their powerful predictive power in the analysis of both linear and nonlinear relationships. They have been successfully applied to several fields including ecological modelling and predictive species’ distributions.
2. Here we present Simapse – Simulation Maps for Ecological Niche Modelling, a free and open-source application written in Python and available to the most common platforms. It uses ANNs with back-propagation to build spatially explicit distribution models from species data (presence/absence, presence-only and abundance).
3. The main features include the automatic production of replicates with different sub-sampling methods and total control of ANN structure and learning parameters.
4. Simapse uses common text formats as main input and output and provides assessment of variable importance and behaviour and measurement of model fitness.
A common type of ANN is the feed-forward neural network with back-propagation learning (BPN; Lek & Guégan 1999). This network has a layered structure of neurons connecting the inputs to an output through one or several hidden layers (see supplementary material Fig. S1 for more details). The BPN has been used to model ecological systems owing to its efficient learning ability and to its simple nature that makes it easy to understand (Lek & Guégan 1999; Özesmi, Tan & Özesmi 2006a). The unit of the BPN is an artificial neuron with an activation function, usually linear or sigmoid. It squashes the sum of the products of all connecting weights and the respective neuron output in the previous layer to a value to be passed to the next layer of neurons by the connecting weights.
Simulation Maps for Ecological Niche Modelling (Simapse –http://purl.oclc.org/simapse) is an open-source and multi-platform application released under GNU Public License (GPL) and written in Python (http://www.python.org) that applies the pattern recognition power of BPN to ecological data within a spatially explicit framework (Fig. 1). Although Simapse’s dependence on a few external python modules for graphing purposes, a complete model can be built only with Python core installation. The process of creating potential distribution maps with Simapse is straightforward and benefits from the graphical user interface, the strong spatial component, input and output with common text formats and options to full control the sub-sampling and learning ability (Table 1). This application automates the process of building several models with different sub-sampling methods and the creation of a final averaged prediction, assuring robust results by taking into account the independent information of the individual models (Araújo & New 2007) and a description of uncertainty between individual models. Simapse is also able to project models to a different set of the same variables, including distinct spatial or temporal extents.
Table 1. Overview of Simapse’s general options
A percentage of data is randomly set aside of the sample records for evaluating the error or the network
The dataset is divided in k folds, and each model is trained with k-1 folds and tested with the remaining fold
Each sub-sample of user-defined size is obtained by random sampling with replacement from the dataset
The iterations are divided in internal (number of times that the data are passed through the network to minimise the error until the report) and reported (quantity of reports that should be made to choose the best network)
Defines the learning amount (an indication of the value is given by the hint button)
Defines the inertia of the learning, i.e. the influence of the previous weight change in the current change
Network’s hidden layers architecture defined by the user: neurons per layer separated by comma (e.g. ‘3, 2, 4’ creates three hidden layers with three, two and four neurons)
The percentage of the sub-sampled dataset (by random repetition or bootstrap) that will be used to test the network
With presence-only datasets, defines the proportion of pseudo-absences to be created in relation to the number of presences
Number of beginning iterations to achieve minimum learning
If active, defines the AUC threshold to accept a reported network
Input data and general options
The graphical interface of Simapse has five main areas (Fig. 1): (i) the input/output definitions; (ii) the sub-sampling methods; (iii) the BPN proprieties and general options; (iv) the buttons area; and (v) the text box. Simapse uses target data (i.e. species presence) and independent variables as inputs to build a model. The independent variables are ASCII raster files and should be placed inside a directory that is given to Simapse. These variables are automatically standardised to z-scores (Lek et al. 1996; Özesmi & Özesmi 1999; Özesmi, Tan & Özesmi 2006a). All standardised variables are saved as ASCII rasters in a directory inside the provided raster directory.
The targets may be presence/absence, presence-only or abundance data in a text file formatted with a header (e.g. target;longitude;latitude) and samples by row and fields separated by semicolon. The presence and absence data are defined in the text file by 1 and 0, respectively. When using presence-only data, the user may define a ratio of presence and pseudo-absences. When using presence/absence or presence-only data, the evaluation of the final model is made by a receiver operating characteristic (ROC) and precision-recall (PR) curves with the respective area under the curve (AUC) values. These methods are based on the confusion matrix of real and predicted values, but whereas ROC uses the full table by comparing the sensitivity and 1-specificity, the PR avoids the use of the true negative values by using the precision and recall values (Davis & Goadrich 2006). These measures have a high discrimination power but should be analysed with care, especially with pseudo-absences or when comparing different algorithms (Peterson, Papes & Sober 2007; Lobo, Jiménez-Valverde & Real 2008). As all results are outputted as text files, model’s performance may be evaluated by means other than those available natively in the application.
Modelling abundance data with Simapse is possible but it requires continuous data between zero and one. The input species data must be previously scaled to this range. To evaluate the performance of the model, Simapse builds a cross-validation plot where each abundance value is plotted against the output predicted value of the model.
To construct several model replicates, there are three options of sub-sampling the target data: (i) random repetitions, where the user defines the number of model repetitions and a percentage of the data randomly chosen to test each repetition; (ii) k-fold cross-validation, where data are divided in k folds and each fold is tested against the remaining fold, resulting in a total of k models; (iii) bootstrapping, where a new dataset is created, based on a percentage defined by the user of the original dataset, by randomly sampling the data with replacement for both training and test datasets.
The options for the BPN assembly are divided by two main groups: the network structure and the learning options. The user has to define the structure of the hidden layers, as the input layer is based on the number of detected variables in the rasters directory and a single output is always used. The hidden structure is defined with the number of neurons per layer separated with commas. Although Simapse allows several hidden layers, one is usually enough to solve complex iterations between the dependent and the independent variables. A simple structure usually is more prone to generalisation, thus avoiding overfitting, and requires less computing power and time (Özesmi & Özesmi 1999). Usually, the choice of the hidden architecture is made by trial-and-error (Dimopoulos 1999; Özesmi, Tan & Özesmi 2006a; Özesmi et al. 2006b). An indication of the learning rate value is obtained using the hint button, which tests several learning rates values with the internal iterations, momentum and hidden structure defined by the user and presents the values classified by the amount of learning they can produce.
The learning options available refer to the number of iterations, learning rate and momentum. The final number of iterations is the product of internal iterations (i.e. the number of times each sequence of targets is passed through the network during the training phase) and the reported iterations, where Simapse reports the error and, optionally, AUC value of the training process. During the training stage, Simapse saves each reported network to the output folder. After the training process, only the best network is preserved and the choosing algorithm acts by selecting the network that presents the lowest sum of training and test errors. When an AUC threshold is defined, all networks that did not meet the threshold are removed previously to the test. This process results in a model that is representative of the training data relationships, avoiding possible overfitting of the BPN by testing each network (i.e. training) against a second dataset (i.e. test). This procedure allows the achievement of a good generalisation (Dimopoulos 1999).
After running the model, the user-defined output folder contains all the results produced by Simapse, saved in text and image formats. The successful built models are saved in the output folder as rasters and are averaged to a single consensus model. Simapse also produces rasters of prediction uncertainties by calculating the spatial standard deviation of all models.
Although ANNs are still seen as ‘black boxes’, there are several processes to disentangle the effect of predictors in the model (Fu & Chen 1993; Dimopoulos, Bourret & Lek 1995; Lek et al. 1996; Olden & Jackson 2002; Gevrey, Dimopoulos & Lek 2003, 2006a; Gevrey, Lek & Oberdorff 2006b; Özesmi, Tan & Özesmi 2006a; Özesmi et al. 2006b). Simapse incorporates sensitivity techniques that provide reliable results in identifying the variable’s general contribution and response (Gevrey, Dimopoulos & Lek 2003). The partial derivatives algorithm (PaD) measures the sensitivity of the network with respect to the input data. Two outputs are given by Simapse using PaD: (i) the variable contribution to the model; and (ii) the individual partial derivatives that measure the sensitivity throughout each variable range. The profile algorithm acts by setting all variables to zero except one, for which it depicts the predictive behaviour throughout its range of values. We added a third method, the variable surface, that is similar to the profile method and is based on Lek’s algorithm (Lek et al. 1996). It plots the prediction surface of a variable throughout the range of all other variables.
In addition to the plots, Simapse also outputs text files with the results data for all sensitivity analyses and model building stages. To easily retrieve any of those files in the output folder, Simapse also creates a report of the model with a full summary and links to all images and respective text files. All spatial results are output as ASCII raster files and are ready to import to most GIS packages.
To better illustrate the work flow with Simapse and its outputs, we created a simple VS widespread throughout Europe based on five real environmental variables (Fig. 2). The original variables data were downloaded from Worldclim (http://www.worldclim.org/) with 10’ resolution and further processed to create the maximum and minimum precipitations and temperatures, plus the altitude data. The presence area of the VS was obtained by averaging the Gaussian or logistic functions applied to the variables (Fig. 2; see Supplementary material Fig. S2). A dataset with 100 presence locations chosen randomly from the presence area of the VS was used as input to the model (this dataset is included with the download of Simapse).
We used a single hidden layer network with five neurons and set the learning rate to 0·1 after the hint given by the application. The sub-sampling method was set to 50 random repetitions. Each replicate was trained with 1000 iterations and filtered with the AUC value (0·9 for train and 0·8 for test). All other parameters were set to the application’s default. The same set of variables used to construct the VS was used as predictors to build the model.
After running Simapse, five replicates were discarded from the consensus model owing to not meeting the AUC threshold. The consensus model matched the area of presence of the VS (Fig. 2) from the presences data, although locations in the border of the presence area showed higher standard deviation (Fig. 2; Supplementary material Fig. S3).
Simapse provides images of the model and exhaustive analysis (see Supplementary material Fig. S3, S4). For each variable, it produced a series of partial derivatives, profiles and variable surfaces plots (see Supplementary material Fig. S5–S7).
Simapse exhibited a good learning ability depicting the distribution of the species with presence-only data. The learning process was able to detect the general trend of the variable’s use with the presence-only data, as shown by the comparison of the real use of the variable by the VS and the sensitivity analysis results (see Supplementary material Fig. S8), found in the text files in output folder. The user benefits from the availability of results in this format to easily produce plots to fit particular purposes.
Simapse provides a spatially explicit framework to model species’ distributions with ANNs with sensitivity analysis for studying the influence of each explanatory variable. The example shown here illustrates the work flow with a VS that has a very simple relation with the descriptors. Despite the extensive use of ANNs with ecological data, testing Simapse with more complex models and real case examples and different sampling strategies is needed. Nevertheless, we expect that the easy learning path of Simapse with native analysis of the results may provide additional advantages over other code-demanding approaches, especially under more pragmatic areas like applied conservation. Being an open-source software, it also may benefit from the experience of more advanced users to suggest and/or improve it. This system also allows adjustments of the application to fit the specific scope of different user’s projects.
The transparent framework, with all data outputted as text, is expected to result in ample understanding of the built models, allowing completing the descriptive analysis by other means than those native to the application. Moreover, the models produced may integrate broader approaches where analyses with different algorithms are used. We expect that the easy pathway provided by this application to predict species distributions along with the efficient pattern-finding ability of BPNs will assure the usefulness of Simapse for biodiversity studies.
PT, SC and JCB are supported by Fundação para a Ciência e Tecnologia (SFRH/BD/42480/2007, SFRH/BPD/74423/2010 and Programme Ciência 2007, respectively). We thank A. Townsend Peterson and an anonymous reviewer for the helpful comments to a preliminary version of the manuscript.