SPECIES: A platform for the exploration of ecological data

Abstract The modeling of ecological data that include both abiotic and biotic factors is fundamental to our understanding of ecosystems. Repositories of biodiversity data, such as GBIF, iDigBio, Atlas of Living Australia, and SNIB (Mexico's National System of Biodiversity Information), contain a great deal of information that can lead to knowledge discovery about ecosystems. However, there is a lack of tools with which to efficiently extract such knowledge. In this paper, we present SPECIES, an open, web‐based platform designed to extract implicit information contained in large scale sets of ecological data. SPECIES is based on a tested methodology, wherein the correlations of variables of arbitrary type and spatial resolution, both biotic and abiotic, discrete and continuous, may be explored from both niche and network perspectives. In distinction to other modeling systems, SPECIES is a full stack exploratory tool that integrates the three basic components: data (which is incrementally growing), a statistical modeling and analysis engine, and an interactive visualization front end. Combined, these components provide a powerful tool that may guide ecologists toward new insights. SPECIES is optimized to support fast hypothesis prototyping and testing, analyzing thousands of biotic and abiotic variables, and presenting descriptive results to the user at different levels of detail. SPECIES is an open‐access platform available online (http://species.conabio.gob.mx), that is, powerful, flexible, and easy to use. It allows for the exploration and incorporation of ecological data and its subsequent integration into predictive models for both potential ecological niche and geographic distribution. It also provides an ecosystemic, network‐based analysis that may guide the researcher in identifying relations between different biota, such as the relation between disease vectors and potential disease hosts.


SPECIES 1.0 A brief tutorial
Currently, it has been an important increase of digital data of biodiversity and environmental information, such as museum specimen collection data, climatic and topographic raster layers. Furthermore, with the creation of open databases like the Global Biodiversity Information Facility (GBIF) (www.gbif.org), or WorldClim -Global Climate Data (www.worldclim.org), there are many data that are now publicly available. The challenge is to find tools with which to transform that data into knowledge, in ways that are useful for a wider range of users.
Here, we give a basic introduction of SPECIES, which is a computer tool for the exploration and analysis of geographic data to constructed species ecological niches, species potential distribution and Complex Inference Networks for community analysis. SPECIES uses a spatial data mining framework, and takes as input data any spatial variable (e.g. collections points of any mammal species or temperature values) from a pre-defined geographical region, identifying statistical associations based on the degree of co-occurrence between our target variables, for instance species-climate, species-habitat or species-species associations.
In order to determine a co-occurrence between geographic variables SPECIES uses a uniform rectangular grid that divided the region of interest into regular spatial cells, xα, and then counted co-occurrences within each xα for a class, C, and a subset of potential predictive variables, X . To quantify which spatial variable co-distributions show a statistically significant correlation, relative to the null hypothesis that their distributions are independent and randomly distributed over the study region, SPECIES calculate the statistical diagnostic epsilon, ɛ(C|X), where values of |ε| > 1.96 correspond to a greater than 95% confidence that the co-occurrences occur at a rate inconsistent with the null hypothesis. Epsilon is building block form which species distributions, species niches and complex inference networks can be constructed. The basic hypothesis is that ecological interactions can be inferred from the relative distributions of spatial variables.

SYSTEM DESCRIPTION
Conversely to many ecological modelling applications which are downloadable software, SPECIES uses a web-based application, and it can be accessed using web browsers, like Google Chrome, Microsoft Edge, Mozilla Firefox or Safari, from URL http://species.conabio.gob.mx/.
SPECIES provides tools to share the results of an analysis: tables can be exported in CSV or Excel format; maps can be downloaded in vector format (shapefiles); and ecological networks as CSV files. Another way to share an analysis is by sharing the setup of the analysis via a URL that reproduces the exact setup of the experiment Additionally, SPECIES is currently linked to National Biodiversity Information System (SNIB for its initials in Spanish) of the CONABIO, therefore, the users have access to a main database of Mexican biodiversity that includes around 8 million of georeferenced localities of 81,603 species of flora and fauna. SPECIES also includes 19 bioclimatic variables from WorldClim data base.
SPECIES have two modules of analysis: 1) ecological niches and 2) ecological community. To show the workflow in SPECIES, we presented two study cases for each module of platform http://species.conabio.gob.mx/: 3

ECOLOGICAL NICHE MODULE: Study case: Integrating biotic and abiotic variables to evaluate potential distribution of Lynx rufus in Mexico
An important goal in niche modelling is to determine and compare the contribution of biotic and abiotic niche features, to obtain a better understanding of their relative importance in species distribution. Therefore, a niche modelling methodology that allows us to include different types of variables, such as climate and biotic interactions, offers a fruitful framework within which to explain the ecological processes that occur from local to regional scales, and understanding which factors, barriers or biotic interactions are important for a particular species in a particular geographical location.
In other words, to better understand the relation between the geographical distribution of the species and its niche.
We showed SPECIES workflow for analysis of ecological niche and potential distribution of bobcat (Lynx rufus), integrating biotic and abiotic predictors. Because of L. rufus is a carnivore species we used other mammal species (potential preys) as biotic predictors: 1. Accessing the module "Ecological Niche" This first screen shows us two sections: 1) "Species", where we selected our target species, and 2) "Variables group", where we selected our predictor variables.

Selecting input data
We write species scientific name, i.e. Lynx rufus to selected its geographic data The system displays information about the species, in the "Summary" tab, we find taxonomic information, number of collections points (bobcat has 630 records), and the number of cells with species occurrences (for a grid of 20km 2 bobcat has 247 unique cells). In "Validation & filters" tab, we can selected that SPECIES performed a validation analysis, (default option is 70% of cells for training and 30% cells for testing), minimum number of cells with occurrences, (i.e. the minimum number of unique cells that any predictor variables should have for the analysis), a probability map to display species potential distribution, and a filter by dates, which allows us to selected a particular period of time.
The system display species localities in a map, where we can explore this information. By selecting any collection point we can access to the metadata of that record. If we identified an incorrect record, we can remove it by clicking on the eraser icon (upper left corner) and then clicking on the point to be removed 5 The next step is to select our predictor variables. In "Variables group" SPECIES has, at the moment, two types of variables: 1) biotic variables, which are collection points of around 80,000 flora and fauna species, and 2) abiotic variables, including 19 bioclimatic layers.

6
The system allows us to select biotic variables at any taxonomic level. In our particular case, due to L. rufus is a carnivore species, we selected Class Mammalia (1), to which potential bobcat's pyres belong. Then, we selected those groups within the Class Mammalia that we want to include (2).
Finally, we add these groups for the analysis (3). Thus, we have biotic predictors to characterize bobcat´s ecological niche.
In the next tab we can select climatic variables. Users can select the 19 bioclimatic layers or chose only those that are relevant for your particular study. In this case we added the 19 variables.
Finally, we run the experiment with "Explore information" button.

Exploring Ecological Niche module Outputs
A first result is a potential species distribution map, which is built by a Naïve Bayes approach calculating score contribution for each predictor variable. A completed explanation of how geographic map is constructed can be seen in the manuscript of SPECIES (Stephens et al). In summary, this map showed a gradient form optimal niche conditions (dark red colour) to suboptimal niche conditions (dark blue colour) determined by a combination of biotic and abiotic factors. Dark red areas represent those regions with high probability for L. rufus to be present.
If we select any cell, the system display the information contained in that cell, shows us each variable contributions (Scores values) for the potential presence or not presence of our target species. This map can be downloaded clicking in the icon . The information is sending to your email. Map is in vector format (*.shp) to be displayed in any software of geographic information system, e.g. QGIS, ArcGis, Diva-Gis. the initial setup, the system displays a curves over the bar graph, which showed the cumulative proportion of presences predicted correctly in each decile. The system calculate curves for each type of variable selected and for the combination of them, allowing us to note which variable or combination, is more predictive. 9 To the end of web page, the system displays a This first screen shows us two sections: 1) "Source group", where we selected our species of interest, and 2) "Target group", where we selected species with which we are interested to identify potential interactions. Additionally, the platform give us the option to select the minimum number of cells with occurrence for any species.

Exploring Community module Outputs
To explore community outputs we see three windows. In the first (left of the image) is the network graph itself. Network show us source and target nodes with different colours and sizes. In the example, Lutzomyia species are in blue and mammals are in orange. Nodes size is related to the number of species occurrence, so, big nodes are species with presence in a great number of cells, and small nodes is the contrary. If users select any node, the system display the species name and number of cells with its occurrences; additionally, all nodes linked to previously selected node are lighted. When you selected a group of nodes, the map in the bottom right window shows us richness for each cell, this richness heat map allows us to identify geographic patterns of species richness. The third window (top right) is the histogram of correlations, with epsilon range values in x-axis, here, we can indicated an umbral of ɛ and the system jus display those pair of species connected within this ranges of values.
At the end of web pages, there is a table with all pair species combination for our target and source groups. This table contains names of species, number of cells where co-occur source and target species (nij), number of cells of target species (nj), number of cells of source species, (ni), total of cells (n) and ε values. When we download the network clicking in button , the system send us this table in CSV format, which can be opened in Excel our other software for network analysis.