VisTrails SAHM: visualization and workflow management for species habitat modeling

Authors


J. T. Morisette, U.S. Geological Survey, North Central Climate Science Center, 2150 Centre Dr., Fort Collins, CO 80526, USA. E-mail: morisettej@usgs.gov

Abstract

The Software for Assisted Habitat Modeling (SAHM) has been created to both expedite habitat modeling and help maintain a record of the various input data, pre- and post-processing steps and modeling options incorporated in the construction of a species distribution model through the established workflow management and visualization VisTrails software. This paper provides an overview of the VisTrails:SAHM software including a link to the open source code, a table detailing the current SAHM modules, and a simple example modeling an invasive weed species in Rocky Mountain National Park, USA.

Understanding where species will thrive is a useful and important consideration for resource managers concerned with either promoting (for threatened or endangered species) or controlling (for invasive or unwanted species). The field of species distribution or habitat niche modeling is contributing to this understanding. With an increasing availability of both ecological data (Graham et al. 2004) and software packages to fit ecological niche models (Phillips et al. 2006, Franklin 2009, Thuiller et al. 2009, Guo and Liu 2010, Peterson et al. 2011) as well as new tools to evaluate model performance (Allouche et al. 2006, Phillips and Elith 2010, Warren et al. 2010), researchers and land managers now have an unprecedented opportunity to explore many parameters and iterations for any given habitat niche modeling exercise. Each niche modeling technique has multiple parameters and options that can be adjusted and choices for input and output data. For habitat models that consider climate change, there are future climate projections from different climate modeling centers and multiple emissions scenarios to consider (IPCC 2007). Land managers might want to evaluate different biological responses; such as different/multiple species or, for a given species, different life cycles (e.g. breeding vs nesting habitat). Furthermore, it may be of interest to modify the spatial extent and spatial resolution of both input/ predictor layers and output/model results. With these options and others not listed here, the potential number of model runs and related results can be overwhelming. There is a need for careful documentation of the precise model configuration as well as meaningful interpretation of results. Scientific workflow systems help address this need.

Existing framework: VisTrails

Scientific workflow systems provide languages with well- defined semantics to specify computational processes which integrate existing applications according to a set of rules (Dedeke 2004). They can also model complex analysis processes at various levels of detail and systematically capture the required information necessary for reproduci bility, result publication, and sharing; collectively referred to as workflow ‘provenance’ (Deelman et al. 2006, Scheidegger et al. 2008). Visualization systems help people explore and explain data by allowing the creation of both static and interactive visual representations (Johnson et al. 2006).

VisTrails is an open-source provenance management and scientific workflow system designed to integrate the best of both scientific workflow and scientific visualization systems that combines a provenance-enabled workflow system with powerful visualization techniques (Freire et al. 2006). As a visualization system, VisTrails makes advanced scientific visualization techniques available to users allowing them to explore and compare different visual representations of their data. As a scientific workflow system, VisTrails enables the composition of workflows that combine specialized libraries, distributed computing infrastructure, and Web services.

A beta version of the VisTrails system was first released in January 2007 and VisTrails 2.0 was released in 2012 < www.vistrails.org/index.php/All_releases >. The system has been downloaded over 35 000 times. The VisTrails wiki has had over 1.5 million page views, and Google analytics reports that visitors to the site come from 65 different countries. VisTrails has been adopted in several scientific projects, both nationally within the United States and internationally, and in different areas, including environmental sciences (Baptista et al. 2008, Howe et al. 2008), psychiatry (Anderson et al. 2007), astronomy (Tohline et al. 2009), cosmology (Anderson et al. 2008), high-energy physics (Dolgert et al. 2008), and quantum physics (Bauer et al. 2011, Freedman et al. 2012). VisTrails is also a key com ponent of the Ultrascale Visualization – Climate Data Analysis Tools (UV-CDAT), a new toolset for large-scale climate data analysis < http://uv-cdat.llnl.gov >. VisTrails has been successfully used as a tool for teaching, having been adopted at universities in the United States and abroad (Silva et al. 2011), as well as a tool for creating reproducible publications (Koop et al. 2011, Freire and Silva 2012).

A distinguishing feature of VisTrails is a comprehensive provenance infrastructure that maintains detailed history information, referred to as provenance, about the steps followed and data derived in the course of an exploratory task (Freire et al. 2006). VisTrails maintains this provenance for resulting data products (e.g. visualizations, plots, raster output), tying them to the workflows from which they were derived. The system also provides extensive annotation capabilities that allow users to enrich the automatically captured provenance.

VisTrails works through a graphical user interface where the user pulls pre-defined modules onto the central portion of the application’s display for inclusion in the workflow. Modules are connected based on rules specific to each module and their pre-defined relationships to each other. The final set of connected modules is referred to as a ‘workflow’. When the user makes any changes to a given workflow, those transformations are captured in a ‘version tree’ as provenance.

Every workflow in VisTrails is a set of linked modules. Independent of any particular workflow, a set of modules that shares a common analysis objective is grouped in VisTrails as a ‘package’. From the user’s perspective, a workflow can pull modules from any package and a package is simply a method to organize the various modules from which to choose. From the developer’s perspective, organ i zing a group of modules into a package can help organize the software development and deliver a more complete and consistent set of modules to the user. VisTrails comes with a fairly extensive set of modules < www.vistrails.org >, yet lacked a package dedicated to geospatial analysis focused on habitat modeling. The Software for Assisted Habitat Modeling was created to address this gap.

Novel contribution: SAHM

There are several advantages to conducting species distri bution modeling within the VisTrails:SAHM context. First, it allows for the formalization and tractable recording of the entire modeling process. In much of the modeling literature, emphasis is placed on the correlative modeling routines with little attention given to the decisions and steps taken both before and after the actual modeling routine is run or proper documentation of those steps. However, there is evidence that some of these pre- and post-modeling steps can significantly affect results (Heikkinen et al. 2006, Pearson et al. 2006, Diniz et al. 2009, Nenzén and Araújo 2011, Rodda et al. 2011, Synes and Osborne 2011). At the very least, it is essential to document all of these steps if the results are to be reproduced. Secondly, it can allow for easier collaboration. Having a common modeling framework that encompasses the entire modeling process allows one group to understand what the other has done and to iterate on adjustments or changes to the model configuration in a way that is fully explicit and, again, tractable. Thirdly, by wrapping many disparate tools and custom processing steps in a user-friendly interface, software compatibility and file management burden on researchers is greatly reduced. Finally, formalizing the components in a modular setting facilitates future and additional modeling routines and tools. For example, if a group develops a new set of predictor layers to be considered by the ecological niche modeling community, a modular workflow system can allow such data to be incorporated through a new ‘input data’ module.

The modules in SAHM can be divided into five main components: 1) input data (both field observations and predictor layers), 2) preprocessing, 3) preliminary model analysis and decision, 4) correlative models, and 5) output routines. These five components are highlighted in a representative model workflow (Fig. 1, Table 1). These modules use a suite of open source or freely available libraries (described in detail in the user guide) and designed to make it easy to add new model algorithms or geoprocessing procedures, requiring only a Python wrapper for their incorporation into VisTrails.

Figure 1.

Example workflow from VisTrails:SAHM with the five main modeling steps highlighted.

Table 1. A listing of the current SAHM modules with a brief description of what they do.
ComponentModuleFunctionality
Input*PredictorsAccess commonly used covariate geospatial data from the local file system.
  (* is a descriptive prefix pertaining to a particular group of predictors.)
 TemplateLayerDefine a template for the modeling grid which provides the extent, spatial resolution, and projection information for subsequent geospatial processing. There is no constraint that this template grid has to match all or any of the predictor layers. (For issues related to modeling grid configuration, see Guisan et al. 2007.)
 FieldDataIngest the csv file containing the occurrences of the response data, e.g. species presence, presence/absence, or count data.
Pre-processingPARCProjection, aggregation, resampling, and clipping of input covariate geospatial data to match the template layer.
 FieldDataQueryFunctions for reformatting and subsetting raw field data files.
 FieldDataAggregation AndWeightProvide options for how multiple occurrence records within a single model grid pixel are handled. This module can address declustering of input field data.
 MDSBuilderGenerate a Merged Data Set (MDS) which contains the species point data as well as pixel values at each point for each predictor. Optionally adds background points.
 RasterFormatConverterConvert predictor rasters between common file formats.
Preliminary model analysis and decisionsModelEvaluationSplitSplit the input points into test and training groups stratified by the response given the percent of the data that should be used for testing. Evaluation metrics from this split are only calculated following model selection when the user elects to select and test the final model.
 ModelSelectionSplitSplit the input points into test and training groups stratified by the response given the percent of the data that should be used for model selection. Evaluation metrics from this split are reported in the model output plots, textual output as well as the appended output and are intended for selecting the best model or set of models.
 ModelSelectionCross ValidationSplit the input points into a specified number of cross validation folds and calculate evaluation metrics for each combination of training and hold out data. These metrics are reported in the module output plots, textual output, and the appended output and are intended for selecting the best model or set of models.
 CovariateCorrelation AndSelectionDisplay the scatter plot and correlation between each pair of covariates and the relationship between each covariate and the response. Allows for exclusion of any covariate from subsequent processing.
Modeling routines implemented for presence/absence or presence only responseBoostedRegressionTree (BRT)An algorithm that starts with a single decision tree, then adds trees that best explain the error (based on deviance reduction) shrinking the influence of each tree as new trees are added. Model simplification is carried out in a stepwise procedure using cross validation to assess predictive performance. This module uses the ‘gbm’ function in R, based on code by Elith et al. (2008).
 GLMFits a generalized linear model using a bidirectional stepwise procedure to select covariates based on Akaike’s information criterion (AIC) using the glm function in R.
 RandomForest (RF)A machine learning ensemble classifier that fits numerous decision trees using random subsets of the covariates and observations at each step using the randomForest (Breiman 2001, Liaw and Wiener 2002) package in R.
 MARSMultivariate Adaptive Regression Splines (MARS) fit piecewise logistic regression models to presence/absence data. Overfitting is controlled by a penalty term that can optionally be set by the user. This model used the MARS package in R, based on Leathwick et al. (2006).
Modeling routines implemented for presence-only responseMaxentA module to incorporate the Maximum Entropy software (Phillips et al. 2006) into the SAHM/VisTrails environment.
 Modeling routines, along with all other SAHM modules, allow the user to specify several parameters. The optimization methods or values used by default are documented in the user manual and can also be accessed by right clicking on any module.
Output routinesSAHMModel OutputViewerCellView several common model evaluation plots and metrics (such as Receiver Operating Characteristic (ROC) curves, response curves, calibration plots, confusion matricies, spatial patterns of deviance residuals, and variable importance plots).
 SAHMSpatial OutputViewerView the spatial model output in a GIS window along with the occurrence data. Outputs include probability of occurrence map, binary map, smoothed residual surface, multivariate environmental similarity surfaces maps (MESS maps), and most dissimilar variable maps (MOD maps; Elith et al. 2010).
 All output is written to files in the session folder specified by the user. Graphic displays are stored as jpegs, summary evaluation metrics are stored in a text file, and maps as tiffs.

The parameter exploration tool and visualization capabilities are two components of VisTrails that offer added utility for habitat mapping. Within the SAHM package, parameter exploration allows the user to specify a list of two or more parameters (that is, modeling options) to be run automatically through the modeling analysis. For example, the template layer shown in Fig. 1 points to a geospatial file (such as a shapefile or Geotiff) to set the spatial extent and resolution for the final model output. The Parameter Exploration tool could be used to specify a list of geospatial files to produce models with varying spatial extent or resolutions, or both (explained in more detail in the user guide and associated tutorial). The output optionally also includes MESS (multivariate environmental similarity surface) maps following Elith et al. (2010) to illustrate levels of extrapolation,the importance of which is demonstrated in Stohlgren et al. (2011).

An example modeling invasive plants in Rocky Mountain National Park, USA

The utility of VisTrails:SAHM can be demonstrated through a habitat modeling exercise, illustrated with the workflow shown in Fig. 1. This workflow is fairly typical in scope yet still involves a significant amount of pre-processing, model settings, and results to consider. This particular modeling exercise was conducted to compare results using a combination of remote sensing and climate data against models that use either one or the other, or both.

The climate data used were a subset of the bioclim variables derived from 800 m resolution PRISM climate data averaged over 1971 to 2009, obtained from Climate Source < www.climatesource.com/ > and converted into 19 bioclim layers following the WorldClim algorithms < www.worldclim.org/bioclim >. These layers were included in the model via the PRISMPredictors module (Fig. 1). The remote sensing data used were a subset of the MODIS land surface phenology products (Tan et al. 2011), obtained from NASA Goddard Space Flight Center <accweb.nascom.nasa.gov <. These layers were included in the model via the MODISPhenologyPredictors module. In addition, we included a vegetation cover layer for Rocky Mountain National Park made available through the National Park Service Data Store < https://irma.nps.gov/App/Reference/Profile/1040622 > specified through the generic ‘Predictor’ module. The pop-up window that results from the CovariateCorrelationAndSelection module was used to select which variables to include in the model.

The three sets of predictors were run through four modeling routines (GLM, RF, BRT, and MARS; described in Table 1 and shown in the ‘Correlative models’ block of modules in Fig. 1). Each of these four routines were run at two different model resolutions: 100 and 800 m; defined by specifying two different template files (via the TemplateLayer module). Figure 2 shows the results of the modeling from the workflow shown in Fig. 1. The array of output allows the user to consider differences across the modeling routines (looking across a row) as well as the impact of using the three different predictor (looking down a column).

Figure 2.

Example showing a 3 × 4 VisTrails spreadsheet where the rows represent the three sets of predictor layers (row 1 = climate, remote sensing and land cover, row 2 = no remote sensing, row 3 = no climate) and columns represent the four modeling techniques: (a) habitat suitability maps and (b) ROC curves and associated AUC values (with training data AUC on top and testing data AUC values on the bottom). The colors shown in (a) represent the relative odds of habitat suitability; where cool colors are less suitable and hotter colors represent more suitable habitat. Maps (a) are also available as separate geotiff files and the model diagnostic output (b) are available as text and jpeg files; as described in Table 1.

Figure 2b shows the results of running the same workflow, but with the SAHMSpatialOutputViewerCell (Table 1) modules replaced with the SAHMModelOutputViewerCell (Table 1). While an in-depth exploration of the model results is beyond the scope of this paper, we found that the models which include both the remote sensing and climate variables performed slightly better than those with either no climate or no remote sensing. This is indicated by the model diagnostic, of which the ROC curve and AUC values are shown in Fig. 2b. The other model diagnostics (not shown) behaved similarly. This result held true across the four correlative modeling routines and at both resolutions tested.

Conducting a typical habitat modeling workflow outside of a consolidated workflow management system might require detailed proficiencies in several programs (e.g. GIS, Python, and R) and result in hundreds of inter mediate files and dozens of final outputs. In this example, running these multiple combinations was accomplished entirely in VisTrails within one session and resulted in a set of comparable and organized outputs as well as a full record of the models that were used and the specification of parameters for those models (see the on-line tutorial).

The SAHM software reflects the philosophy that model fitting should be an iterative and interactive process, informed by both geospatial and statistical expertise, but also ecologists who best understand the species of interest and resource managers who might use the results. The users can add or exclude covariates based on their relationship with other predictors or the response, change model fit para meters, select how the cut-off threshold is optimized for turning continuous predictions into a binary presence/absence map, and produce plots to assess residuals and evaluate other model diagnostics. An exciting potential for VisTrails:SAHM is the possibility of connecting directly with remote sensing databases (e.g. MODIS time series or new downscaled climate). The tutorial data provide a small subset of predictor layers. The USGS and NASA are currently developing modules to access satellite pro ducts and climate data from remote sources for use in a locally configured VisTrails:SAHM model. With VisTrails: SAHM, unnecessary complexity is minimized leaving the users with the ability to fine tune and control the model fitting process.

This example shows how VisTrails can increase the efficiency of running multiple models and storing infor mation to reproduce results. This can help create a ‘common language’ for the community to utilize when testing multiple modeling techniques, input layers, and parameter settings. The community is invited to explore VisTrails:SAHM and is encouraged to interact with the developers to collaborate on bringing novel modeling developments into VisTrails through new modules or improvements to existing modules. Full details of the VisTrails:SAHM package are given at < https://my.usgs.gov/catalog/RAM/SAHM >.

To cite VisTrails or acknowledge its use, cite this Software note as follows, substituting the version of the application that you used for ‘version 0’.

Morisette, J. T., Jarnevich, C. S., Holcombe, T. R., Talbert, C. B., Ignizio, D., Talbert, M. K., Silva, C., Koop, D., Swanson, A. and Young, N. E. 2013. VisTrails SAHM: visualization and workflow management for species habitat modeling. – Ecography 36: xxx–xxx (ver. 0).

Acknowledgements

The work was jointly funded by NASA and the USGS. Special thanks to NASA program manager Woody Turner and USGS invasive species program coordinator Sharon Gross. We are grateful to the VisTrails development team and colleagues at Colorado State Univ. who are helping to test VisTrails:SAHM, in particular Paul Evangelista, Sunil Kumar, Nick Young, and Lane Carter. Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. Although the SAHM program has been used by the U.S. Geological Survey (USGS), no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith.

Ancillary