SEARCH

SEARCH BY CITATION

Keywords:

  • artificial neural networks;
  • ecological niche modelling;
  • Python;
  • species distribution

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

1. Artificial neural networks (ANNs) are known for their powerful predictive power in the analysis of both linear and nonlinear relationships. They have been successfully applied to several fields including ecological modelling and predictive species’ distributions.

2. Here we present Simapse – Simulation Maps for Ecological Niche Modelling, a free and open-source application written in Python and available to the most common platforms. It uses ANNs with back-propagation to build spatially explicit distribution models from species data (presence/absence, presence-only and abundance).

3. The main features include the automatic production of replicates with different sub-sampling methods and total control of ANN structure and learning parameters.

4. Simapse uses common text formats as main input and output and provides assessment of variable importance and behaviour and measurement of model fitness.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

Artificial neural networks (ANNs) have been used in scientific fields where pattern recognition is a primary need and its application to biological systems had a considerable growth during the past decades (Lek et al. 1996; Lek & Guégan 1999;Özesmi, Tan & Özesmi 2006a). As a machine learning algorithm, it is with no surprise that ANNs are increasingly being used in ecological studies. These algorithms are usually seen as more powerful to deal with complex ecological datasets than other methods (Brosse & Lek 2000; Olden, Lawler & Poff 2008; Özesmi et al. 2006b; Pearson et al. 2002).

A common type of ANN is the feed-forward neural network with back-propagation learning (BPN; Lek & Guégan 1999). This network has a layered structure of neurons connecting the inputs to an output through one or several hidden layers (see supplementary material Fig. S1 for more details). The BPN has been used to model ecological systems owing to its efficient learning ability and to its simple nature that makes it easy to understand (Lek & Guégan 1999; Özesmi, Tan & Özesmi 2006a). The unit of the BPN is an artificial neuron with an activation function, usually linear or sigmoid. It squashes the sum of the products of all connecting weights and the respective neuron output in the previous layer to a value to be passed to the next layer of neurons by the connecting weights.

Simapse

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

Simulation Maps for Ecological Niche Modelling (Simapse –http://purl.oclc.org/simapse) is an open-source and multi-platform application released under GNU Public License (GPL) and written in Python (http://www.python.org) that applies the pattern recognition power of BPN to ecological data within a spatially explicit framework (Fig. 1). Although Simapse’s dependence on a few external python modules for graphing purposes, a complete model can be built only with Python core installation. The process of creating potential distribution maps with Simapse is straightforward and benefits from the graphical user interface, the strong spatial component, input and output with common text formats and options to full control the sub-sampling and learning ability (Table 1). This application automates the process of building several models with different sub-sampling methods and the creation of a final averaged prediction, assuring robust results by taking into account the independent information of the individual models (Araújo & New 2007) and a description of uncertainty between individual models. Simapse is also able to project models to a different set of the same variables, including distinct spatial or temporal extents.

image

Figure 1.  Graphical user interface from Simapse in different operating systems. Simapse has a simple layout of options divided by five main areas: 1) the input/output; 2) the sub-sampling methods; 3) the network proprieties and general options; 4) the buttons; and 5) the text box.

Download figure to PowerPoint

Table 1.   Overview of Simapse’s general options
Sub-sampling
 Random repetitionA percentage of data is randomly set aside of the sample records for evaluating the error or the network
 K-fold cross-validationThe dataset is divided in k folds, and each model is trained with k-1 folds and tested with the remaining fold
 BootstrappingEach sub-sample of user-defined size is obtained by random sampling with replacement from the dataset
Network structure
 IterationsThe iterations are divided in internal (number of times that the data are passed through the network to minimise the error until the report) and reported (quantity of reports that should be made to choose the best network)
 Learning rateDefines the learning amount (an indication of the value is given by the hint button)
 MomentumDefines the inertia of the learning, i.e. the influence of the previous weight change in the current change
 Hidden layersNetwork’s hidden layers architecture defined by the user: neurons per layer separated by comma (e.g. ‘3, 2, 4’ creates three hidden layers with three, two and four neurons)
Options
 Test percentageThe percentage of the sub-sampled dataset (by random repetition or bootstrap) that will be used to test the network
 Pseudo-absences ratioWith presence-only datasets, defines the proportion of pseudo-absences to be created in relation to the number of presences
 Burn-in iterationNumber of beginning iterations to achieve minimum learning
 AUC filterIf active, defines the AUC threshold to accept a reported network

Input data and general options

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

The graphical interface of Simapse has five main areas (Fig. 1): (i) the input/output definitions; (ii) the sub-sampling methods; (iii) the BPN proprieties and general options; (iv) the buttons area; and (v) the text box. Simapse uses target data (i.e. species presence) and independent variables as inputs to build a model. The independent variables are ASCII raster files and should be placed inside a directory that is given to Simapse. These variables are automatically standardised to z-scores (Lek et al. 1996; Özesmi & Özesmi 1999; Özesmi, Tan & Özesmi 2006a). All standardised variables are saved as ASCII rasters in a directory inside the provided raster directory.

The targets may be presence/absence, presence-only or abundance data in a text file formatted with a header (e.g. target;longitude;latitude) and samples by row and fields separated by semicolon. The presence and absence data are defined in the text file by 1 and 0, respectively. When using presence-only data, the user may define a ratio of presence and pseudo-absences. When using presence/absence or presence-only data, the evaluation of the final model is made by a receiver operating characteristic (ROC) and precision-recall (PR) curves with the respective area under the curve (AUC) values. These methods are based on the confusion matrix of real and predicted values, but whereas ROC uses the full table by comparing the sensitivity and 1-specificity, the PR avoids the use of the true negative values by using the precision and recall values (Davis & Goadrich 2006). These measures have a high discrimination power but should be analysed with care, especially with pseudo-absences or when comparing different algorithms (Peterson, Papes & Sober 2007; Lobo, Jiménez-Valverde & Real 2008). As all results are outputted as text files, model’s performance may be evaluated by means other than those available natively in the application.

Modelling abundance data with Simapse is possible but it requires continuous data between zero and one. The input species data must be previously scaled to this range. To evaluate the performance of the model, Simapse builds a cross-validation plot where each abundance value is plotted against the output predicted value of the model.

To construct several model replicates, there are three options of sub-sampling the target data: (i) random repetitions, where the user defines the number of model repetitions and a percentage of the data randomly chosen to test each repetition; (ii) k-fold cross-validation, where data are divided in k folds and each fold is tested against the remaining fold, resulting in a total of k models; (iii) bootstrapping, where a new dataset is created, based on a percentage defined by the user of the original dataset, by randomly sampling the data with replacement for both training and test datasets.

The options for the BPN assembly are divided by two main groups: the network structure and the learning options. The user has to define the structure of the hidden layers, as the input layer is based on the number of detected variables in the rasters directory and a single output is always used. The hidden structure is defined with the number of neurons per layer separated with commas. Although Simapse allows several hidden layers, one is usually enough to solve complex iterations between the dependent and the independent variables. A simple structure usually is more prone to generalisation, thus avoiding overfitting, and requires less computing power and time (Özesmi & Özesmi 1999). Usually, the choice of the hidden architecture is made by trial-and-error (Dimopoulos 1999; Özesmi, Tan & Özesmi 2006a; Özesmi et al. 2006b). An indication of the learning rate value is obtained using the hint button, which tests several learning rates values with the internal iterations, momentum and hidden structure defined by the user and presents the values classified by the amount of learning they can produce.

The learning options available refer to the number of iterations, learning rate and momentum. The final number of iterations is the product of internal iterations (i.e. the number of times each sequence of targets is passed through the network during the training phase) and the reported iterations, where Simapse reports the error and, optionally, AUC value of the training process. During the training stage, Simapse saves each reported network to the output folder. After the training process, only the best network is preserved and the choosing algorithm acts by selecting the network that presents the lowest sum of training and test errors. When an AUC threshold is defined, all networks that did not meet the threshold are removed previously to the test. This process results in a model that is representative of the training data relationships, avoiding possible overfitting of the BPN by testing each network (i.e. training) against a second dataset (i.e. test). This procedure allows the achievement of a good generalisation (Dimopoulos 1999).

Output results

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

After running the model, the user-defined output folder contains all the results produced by Simapse, saved in text and image formats. The successful built models are saved in the output folder as rasters and are averaged to a single consensus model. Simapse also produces rasters of prediction uncertainties by calculating the spatial standard deviation of all models.

Although ANNs are still seen as ‘black boxes’, there are several processes to disentangle the effect of predictors in the model (Fu & Chen 1993; Dimopoulos, Bourret & Lek 1995; Lek et al. 1996; Olden & Jackson 2002; Gevrey, Dimopoulos & Lek 2003, 2006a; Gevrey, Lek & Oberdorff 2006b; Özesmi, Tan & Özesmi 2006a; Özesmi et al. 2006b). Simapse incorporates sensitivity techniques that provide reliable results in identifying the variable’s general contribution and response (Gevrey, Dimopoulos & Lek 2003). The partial derivatives algorithm (PaD) measures the sensitivity of the network with respect to the input data. Two outputs are given by Simapse using PaD: (i) the variable contribution to the model; and (ii) the individual partial derivatives that measure the sensitivity throughout each variable range. The profile algorithm acts by setting all variables to zero except one, for which it depicts the predictive behaviour throughout its range of values. We added a third method, the variable surface, that is similar to the profile method and is based on Lek’s algorithm (Lek et al. 1996). It plots the prediction surface of a variable throughout the range of all other variables.

In addition to the plots, Simapse also outputs text files with the results data for all sensitivity analyses and model building stages. To easily retrieve any of those files in the output folder, Simapse also creates a report of the model with a full summary and links to all images and respective text files. All spatial results are output as ASCII raster files and are ready to import to most GIS packages.

Example

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

To better illustrate the work flow with Simapse and its outputs, we created a simple VS widespread throughout Europe based on five real environmental variables (Fig. 2). The original variables data were downloaded from Worldclim (http://www.worldclim.org/) with 10’ resolution and further processed to create the maximum and minimum precipitations and temperatures, plus the altitude data. The presence area of the VS was obtained by averaging the Gaussian or logistic functions applied to the variables (Fig. 2; see Supplementary material Fig. S2). A dataset with 100 presence locations chosen randomly from the presence area of the VS was used as input to the model (this dataset is included with the download of Simapse).

image

Figure 2.  Consensus model of the virtual species presence. The gradient describes the probability of presence. The black dots are the locations of the 100 presences used to model that were randomly selected from the distribution area of the virtual species delimited by the dashed line.

Download figure to PowerPoint

We used a single hidden layer network with five neurons and set the learning rate to 0·1 after the hint given by the application. The sub-sampling method was set to 50 random repetitions. Each replicate was trained with 1000 iterations and filtered with the AUC value (0·9 for train and 0·8 for test). All other parameters were set to the application’s default. The same set of variables used to construct the VS was used as predictors to build the model.

After running Simapse, five replicates were discarded from the consensus model owing to not meeting the AUC threshold. The consensus model matched the area of presence of the VS (Fig. 2) from the presences data, although locations in the border of the presence area showed higher standard deviation (Fig. 2; Supplementary material Fig. S3).

Simapse provides images of the model and exhaustive analysis (see Supplementary material Fig. S3, S4). For each variable, it produced a series of partial derivatives, profiles and variable surfaces plots (see Supplementary material Fig. S5–S7).

Simapse exhibited a good learning ability depicting the distribution of the species with presence-only data. The learning process was able to detect the general trend of the variable’s use with the presence-only data, as shown by the comparison of the real use of the variable by the VS and the sensitivity analysis results (see Supplementary material Fig. S8), found in the text files in output folder. The user benefits from the availability of results in this format to easily produce plots to fit particular purposes.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

Simapse provides a spatially explicit framework to model species’ distributions with ANNs with sensitivity analysis for studying the influence of each explanatory variable. The example shown here illustrates the work flow with a VS that has a very simple relation with the descriptors. Despite the extensive use of ANNs with ecological data, testing Simapse with more complex models and real case examples and different sampling strategies is needed. Nevertheless, we expect that the easy learning path of Simapse with native analysis of the results may provide additional advantages over other code-demanding approaches, especially under more pragmatic areas like applied conservation. Being an open-source software, it also may benefit from the experience of more advanced users to suggest and/or improve it. This system also allows adjustments of the application to fit the specific scope of different user’s projects.

The transparent framework, with all data outputted as text, is expected to result in ample understanding of the built models, allowing completing the descriptive analysis by other means than those native to the application. Moreover, the models produced may integrate broader approaches where analyses with different algorithms are used. We expect that the easy pathway provided by this application to predict species distributions along with the efficient pattern-finding ability of BPNs will assure the usefulness of Simapse for biodiversity studies.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

PT, SC and JCB are supported by Fundação para a Ciência e Tecnologia (SFRH/BD/42480/2007, SFRH/BPD/74423/2010 and Programme Ciência 2007, respectively). We thank A. Townsend Peterson and an anonymous reviewer for the helpful comments to a preliminary version of the manuscript.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information
  • Araújo, M.B. & New, M. (2007) Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22, 4247.
  • Brosse, S. & Lek, S. (2000) Modelling roach (Rutilus rutilus) microhabitat using linear and nonlinear techniques. Freshwater Biology, 44, 441452.
  • Davis, J. & Goadrich, M. (2006) The relationship between precision-recall and ROC curves. Proceedings of the 23rd international conference on Machine learning - ICML, 06, 233240.
  • Dimopoulos, I. (1999) Neural network models to study relationships between lead concentration in grasses and permanent urban descriptors in Athens city (Greece). Ecological Modelling, 120, 157165.
  • Dimopoulos, Y., Bourret, P. & Lek, S. (1995) Use of some sensitivity criteria for choosing networks with good generalization ability. Neural Processing Letters, 2, 14.
  • Fu, L. & Chen, T. (1993) Sensitivity analysis for input vector in multilayer feed forward neural networks. IEEE International Conference on Neural Networks, 1993. pp. 215218.
  • Gevrey, M., Dimopoulos, I. & Lek, S. (2003) Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160, 249264.
  • Gevrey, M., Dimopoulos, I. & Lek, S. (2006a) Two-way interaction of input variables in the sensitivity analysis of neural network models. Ecological Modelling, 195, 4350.
  • Gevrey, M., Lek, S. & Oberdorff, T. (2006b) Utility of sensitivity analysis by artificial neural network models to study patterns of endemic fish species. Ecological Informatics . (ed F. Recknagel), pp. 293306. Springer, Berlin.
  • Lek, S. & Guégan, J.F. (1999) Artificial neural networks as a tool in ecological modelling, an introduction. Ecological Modelling, 120, 6573.
  • Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J. & Aulagnier, S. (1996) Application of neural networks to modelling nonlinear relationships in ecology. Ecological Modelling, 90, 3952.
  • Lobo, J.M., Jiménez-Valverde, A. & Real, R. (2008) AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17, 145151.
  • Olden, J.D. & Jackson, D.A. (2002) Illuminating the ‘black box’: a randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling, 154, 135150.
  • Olden, J.D., Lawler, J.J. & Poff, N.L. (2008) Machine learning methods without tears: a primer for ecologists. The Quarterly Review of Biology, 83, 171193.
  • Özesmi, S. & Özesmi, U. (1999) An artificial neural network approach to spatial habitat modelling with interspecific interaction. Ecological Modelling, 116, 1531.
  • Özesmi, S., Tan, C. & Özesmi, U. (2006a) Methodological issues in building, training, and testing artificial neural networks in ecological applications. Ecological Modelling, 195, 8393.
  • Özesmi, U., Tan, C., Özesmi, S. & Robertson, R. (2006b) Generalizability of artificial neural network models in ecological applications: Predicting nest occurrence and breeding success of the red-winged blackbird Agelaius phoeniceus. Ecological Modelling, 195, 94104.
  • Pearson, R.G., Dawson, T.P., Berry, P.M. & Harrison, P.A. (2002) SPECIES: a spatial evaluation of climate impact on the envelope of species. Ecological Modelling, 154, 289300.
  • Peterson, A.T., Papes, M. & Sober, J. (2007) Rethinking receiver operating characteristic analysis applications in ecological niche modeling. Ecology, 3, 6372.

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Simapse
  5. Input data and general options
  6. Output results
  7. Example
  8. Discussion
  9. Acknowledgements
  10. References
  11. Supporting Information

Fig. S1. Structure of an Artificial Neural Network with a back-propagation learning algorithm.

Fig. S2. Definition of the virtual species' distribution with environmental variables.

Fig. S3. Consensus averaged prediction and standard deviation, as shown by Simapse as a preview of the built model.

Fig. S4. Variable importance, ROC and precision-recall plots as provided by Simapse.

Fig. S5. Partial derivatives plots as provided by Simapse.

Fig. S6. Profile plots as provided by Simapse.

Fig. S7. Variable surfaces as provided by Simapse.

Fig. S8. Results of Simapse's sensitivity analysis and correspondent response data from the virtual species.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
MEE3_210_sm_FigS1-S8.doc2959KSupporting info item
MEE3_210_sm_Video.mp43993KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.