Improving ecological data science with workflow management software

Pressing environmental research questions demand the integration of increasingly diverse and large‐scale ecological datasets as well as complex analytical methods, which require specialized tools and resources. Computational training for ecological and evolutionary sciences has become more abundant and accessible over the past decade, but tool development has outpaced the availability of specialized training. Most training for scripted analyses focuses on individual analysis steps in one script rather than creating a scripted pipeline, where modular functions comprise an ecosystem of interdependent steps. Although current computational training creates an excellent starting place, linear styles of scripting can risk becoming labor‐ and time‐intensive and less reproducible by often requiring manual execution. Pipelines, however, can be easily automated or tracked by software to increase efficiency and reduce potential errors. Ecology and evolution would benefit from techniques that reduce these risks by managing analytical pipelines in a modular, readily parallelizable format with clear documentation of dependencies. Workflow management software (WMS) can aid in the reproducibility, intelligibility and computational efficiency of complex pipelines. To date, WMS adoption in ecology and evolutionary research has been slow. We discuss the benefits and challenges of implementing WMS and illustrate its use through a case study with the targets r package to further highlight WMS benefits through workflow automation, dependency tracking and improved clarity for reviewers. Although WMS requires familiarity with function‐oriented programming and careful planning for more advanced applications and pipeline sharing, investment in training will enable access to the benefits of WMS and impart transferable computing skills that can facilitate ecological and evolutionary data science at large scales.


| INTRODUC TI ON
The fields of ecology and evolution face a computational challenge. Projects increasingly use high-volume, heterogeneously distributed datasets that are difficult to manage or require intensive computer resources (Hampton et al., 2013;Kelling et al., 2009;Michener & Jones, 2012). More data are publicly available at finer spatial or temporal resolutions than ever before, such as monthly downscaled climate data (e.g. Harris et al., 2020), high-resolution (<30 m) land cover classifications (e.g. Grekousis et al., 2015) and billions of species occurrence records (Heberling et al., 2021).
Additionally, the volume of ecological data generated continues to expand rapidly (Farley et al., 2018), with related concerns about security and storage (Kambatla et al., 2014). Many of these datasets are continuously being updated to remain current or to improve accuracy. In ecology and evolution, this expansion of 'big data' creates opportunities to tackle a broad range of questions at unprecedented spatial and temporal scales. However, with this opportunity comes the challenge of harmonizing data from disparate sources for analysis. To make effective use of these resources, scientists require tools to overcome these challenges while enabling reproducibility at scale.
The resources and tools for harmonizing large datasets can be substantial. Often projects will require high-performance computing, cloud computing infrastructure or scripts designed for parallelization to balance growing demands for memory, storage and processing power with finite funds to cover data storage and computing time. Using more efficient practices in our analyses can reduce computational costs of high-performance computing, such as time, money, or availability of a shared resource. For example, a model designed to ingest downscaled monthly climate data estimated for 48 variables, under 15 global circulation models, four socioeconomic pathways and two future timeframes (e.g. AdaptWest Project, 2015) can require runtimes that exceed a reasonable timeframe for most research and stakeholder needs, even when using high-performance and cloud computing infrastructures (e.g. >1 year; Gettelman et al., 2019). Researchers often rely on institutional computing infrastructure or private cloudbased services. However, these services come with substantial financial costs and a steep learning curve for users. Furthermore, many questions in ecology and evolution require joining data from multiple sources, creating additional complexities. For instance, joining spatial data often requires matching coordinate reference systems, delineating spatial boundaries and aligning spatial resolutions. Additional pre-processing or modelling may also be required before the datasets can be joined for analysis. Although each of these tasks alone may not be complex, resource use and methodological complexity increase dramatically as additional datasets and pre-processing steps accumulate. Additional time is also consumed when the researcher is responsible for manually updating data sources and rerunning models at each step of a complex workflow (hereafter also referred to as a pipeline), leading to potential errors or pipelines that are difficult to reproduce (White et al., 2019). Ecology and evolution stand to benefit from tools that help to streamline complex pipelines.
Workflow management software (WMS) can make data pipelines more reproducible, intelligible and less computationally demanding (Leipzig et al., 2021;Sandve et al., 2013). Although definitions and functionality of WMS can vary across contexts and computing languages, WMS in general is an application used to create an efficient and reproducible chain of data processing steps by documenting all components of an analysis, creating modularity and including the dependencies among steps in the process. For example, Kepler (Altintas et al., 2004) and the targets and drake r packages (Landau, 2018(Landau, , 2021b are three prominent examples of WMS that have seen use in the fields of ecology and evolution (e.g. Hampton et al., 2022). A major benefit of WMS is the capacity to track the status of all required files and functions to prevent steps in a larger pipeline from being skipped and by ensuring that data are kept up to date as models or harmonization routines change. Despite WMS's benefits, adopting a WMS requires moving away from performing serial analytical operations within single or multiple scripts to instead breaking an analysis into smaller functions that are modular (Figure 1), thereby providing more computational flexibility.
WMS is increasingly relevant for projects in ecology and evolution but currently is not often used, and training resources are limited. Given the growth in demand for computational training, mismatches between computational skills and researchers' needs (Barone et al., 2017;Prabhu et al., 2011) and the lack of crossinstitutional, standardized computational training (Feng et al., 2020), these tools may be intimidating to many users. Indeed, researchers will face a familiar trade-off-to invest the personnel time in learning and deploying a tool like WMS to save personnel and compute time later or to accept that less efficient, but more familiar analytical frameworks will come with costs in personnel time, compute time and potentially additional associated payments for computing resources. The balance will depend on the complexity of the analysis, the expectation of reusing the code over time and the resources available to the researcher (e.g. funds for computing or personnel). The needs of researchers including workflow reusability have been previously documented as important considerations for WMS computing skills that can facilitate ecological and evolutionary data science at large scales.

K E Y W O R D S
ecological data science, pipeline, reproducibility, targets, workflow management (McPhillips et al., 2009) and likely contribute to the limited use of WMS historically.
The delay in WMS adoption by ecology and evolutionary sciences is even more noteworthy considering that WMS are not new tools. For example, Kepler is a WMS that developed from reproducible synthesis initiatives in the early 2000s (Altintas et al., 2004). With a graphical user interface, Kepler allows for integrating disparate, complex workflows, without the need for advanced coding knowledge that might be required for creating Makefiles, a WMS analog in the C coding language. Even with these roots, WMS were historically not prevalent in the ecological and evolutionary sciences literature, especially relative to physical sciences and engineering ( Figure 2). Recently though, WMS has become increasingly integrated into the tools that environmental scientists frequently use to conduct data wrangling and analyses. The r Statistical Environment (R Core Team, 2022) has the package targets and its predecessor, drake (Landau, 2018(Landau, , 2021b; Esri mapping products have Model Builder (ArcGIS, 2010) and Python has the library snakeMake (Mölder et al., 2021). Despite the integration of WMS into tools like r and python, packages like targets and snakeMake are differentiated from non-WMS software like the tidyverse (Wickham et al., 2019) in r or pandas (McKinney, 2010) in python by their focus on dependency tracking and connecting analytical steps in a workflow. As the tools used to conduct data wrangling and analysis have become more integrated, the time may be ripe for ecologists and evolutionary researchers to begin taking advantage of WMS.
To demonstrate the type of project where WMS saves significant resources and maximizes reproducibility in data-intensive ecological research, we highlight the use of the r package targets in an ecological project focused on modelling chlorophyll responses to climatic drivers using diverse sources of downscaled climate data. We highlight the general approach, benefits and common limitations of

F I G U R E 1
The same analytical workflow is displayed using two different coding formats. Panel (a) is an r analysis using a typical layout consisting of a single script. Panel (b) is an example of formatting for use with the targets workflow management software package in r (R Core Team, 2022). The workflow consists of a mandatory _targets.R script and optional R/functions.R script. Here, the R/functions.R script helps break code into modularized functions. The targets (Landau, 2021b) function tar_visnetwork() generates the dependency graph from the two provided scripts. As analysis code becomes longer and more complex, the benefits of code compartmentalization through targets accrue.

| A C A S E S TUDY WITH TA RGETS
The Ecology Under Lake Ice project (Hampton et al., 2016) brought together a community of lake researchers who collected biological and abiotic data sampled under winter ice and from the open water in summer, to understand how winter ecosystem properties and dynamics compared to the more well-studied summer season. Initial analyses with these data revealed winter as a time with often surprisingly high standing biomass of primary producers (algae), with active grazers and microbial processes (Hampton et al., 2017). Next, we sought to determine whether climate and local weather could help explain variation in algal biomass, as storms can alter the light environment during winter and may alter both nutrients and light during the summer months (Hampton et al., 2022). Complex data assembly was necessary for the project's larger goal of inferring weather conditions during two times of the year when we did not have in situ data-the time at which ice formed each year and the time before summer sampling began.
Thus, the challenge involved harmonizing not only heterogeneous climate data with in situ data but also with our own modelled products.
We used the targets WMS to enhance the efficiency and reproducibility of our analysis because our larger research question required a complex data harmonization pipeline. The aspects of targets that we found most useful included the following: (1) the automated execution of all code once the pipeline was started, (2) the ability for targets to skip steps in the pipeline that were already complete (see Box 1) and run them in order of their upstream dependencies rather than the order in which they were written, (3) the option to generate a directed acyclic graph to visualize the steps in our analysis process and (4) it is designed for use with r (R Core Team, 2022).

F I G U R E 2 Time series of number (upper) and proportion (lower) of publications referencing Workflow Management Software (WMS) or
Data Pipeline within Web of Science (https://clari vate.com/webof scien cegro up/solut ions/web-of-science). Publications are grouped by field of study. Publications referencing WMS began to appear in the 1990s and have increased rapidly since the 2010s. Whereas the majority of publications tend to be associated with engineering and physical sciences, other fields are beginning to incorporate WMS more frequently. Life sciences and earth and environmental sciences are highlighted in both plots to emphasize publications especially relevant to ecology and evolution.

BOX 1 How the targets WMS handles updates to data
A common challenge in analytical workflows is the need to replace existing data with updated or alternative data sources. Doing so can cause consternation with complex workflows where it may be difficult to track which downstream processes need to be rerun after changing a data source. A strength of targets and other dependencytracking software is the ability to track the cascading effects of a change like this and to automatically bring the entire workflow back up to date.
For example, we could update the data source for our chlorophyll analysis workflow to a new version of CRU (Climatic Research Unit) air temperature. Using the tar_visnetwork() function to graph our workflow's dependencies (Figure 3), we would then see that updating the path to our CRU file would invalidate everything that depended on the CRU data target.
The tar_outdated() function (Landau, 2021b) would confirm what our graph shows, providing a vector of target names that would be rerun when updating the workflow. Once we reran the pipeline using tar_make(), all downstream targets would be rebuilt using the new dataset, assuming no errors arose.
Automated code execution was particularly useful for nesting models within the analysis and extracting specific outputs for further analysis; specifically, we used climate data to infer weather conditions during the period of ice formation. Using targets, we first modelled the approximate ice-formation period using high temporal resolution air temperature data, then used the estimated date outputs to subset and aggregate climate data that represented local conditions of precipitation, wind and temperature. These data were then fed into the larger chlorophyll modelling process. Like our project, ecological data science efforts are often confronted with complex, multi-step analysis processes (Michener & Jones, 2012), which could be made more reproducible and potentially less computationally intensive. Although analytical approaches often differ, the growing availability of large, well-documented datasets means that our pipeline could either be reused in part or serve as a guide for future synthesis of climate data across large geographic scales (available at https://github.com/mbrou sil/stormy_lakes_pub).

| Workflow design
We began the Ecology Under Lake Ice analysis with separate scripts for major steps: (1) identifying ice-formation/pre-stratification windows, (2) climate data extraction for each season, (3) data harmonization (i.e., merging data from disparate sources and temporal or spatial resolutions) and (4) statistical modelling. Initially, we organized the workflow steps into a linear series to be run consecutively.
Whereas scripts can be numbered or commented to indicate the order in which they should be run and under what circumstances, this manual approach rapidly becomes intractable as complexity increases, so we opted to transition to the targets WMS.
Our first step was to determine an overarching skeleton for the analysis based on its core dependencies. Conceptually, targets operates off a series of interconnected steps, or 'targets', which define the order in which code is executed. These targets are written in code as expressions or functions listed inside of a central '_targets.R' script ( Figure 1). With dependencies in mind, the code for our targets could then be repurposed from the original r scripts because the necessary tasks were very similar. For example, both versions of the workflow loaded the Ecology Under Lake Ice dataset. Both workflows also used a read_csv() (Wickham et al., 2022) call to do this, but the targets method framed this reading step as one of the early targets in the workflow, upon which downstream steps were dependent. In a more sophisticated instance, the targets workflow calculates the periods of ice formation with a custom function, de-termine_iceform(), which subsets temperature data, applies a rolling average, searches for a start date, then plots and exports the date data. The original workflow would also have needed to complete these tasks, but the construction of a function wrapper for this step made the code more modular and easier to incorporate as a single target in the larger workflow. Such targets should be small enough that they can be skipped (if appropriate) when an analysis is repeated but still large enough that they constitute a significant step in the analysis (Landau, 2021b).
Once all data preparation, seasonal window calculation and climate data aggregation took place, our inputs were fed into a F I G U R E 3 Directed acyclic dependency graph of our workflow generated using tar_visnetwork() alongside example inputs and outputs. Light blue symbols denote out-of-date targets. Green symbols are up to date. The far left blue symbol is the 'cru_airtemp_path' target. The inputs include any data types that are typical in ecology and evolution, such as spreadsheets of in situ lake measurements and climate rasters. These are paired with functions used to clean, manipulate, analyse and visualize the results. The outputs can represent anything, and here are the linear models and partial dependence plot from a random forest regression analysis. Main panel (a) demonstrates the overall workflow of the targets WMS, and inset panel (b) demonstrates the ability to identify linkages among objects. confirmations of which downstream products would be affected by upstream decisions. This reduction in effort unquestionably enabled us to better standardize our outputs and quality control processes.
In cases where various collaborators may have (1) different versions of derived data products and (2) varying data skill sets, our implementation of a WMS could also allow collaborators to efficiently rebuild datasets and analyses without manually rerunning multiple scripts. Using the targets WMS, the pipeline could be reproduced by only requiring collaborators to use the command tar_make() (Landau, 2021b), which executes the workflow. Consequently, our use of a WMS expedited data reprocessing and collaborators did not require access to high-performance computing or knowledge of the entire data pipeline to reprocess all analyses.
To facilitate the use of WMS in ecological data science more broadly, we have prepared an introductory worked example for the targets package using an ecological dataset from the palMerpenguins r package (Horst et al., 2020). This example is less complex than the Ecology Under Lake Ice example and more tractable for beginners. This introductory example is available at https://targe ts-ecolo gy.netli fy.app/ and through its associated Open Science Framework repository (Brousil, 2022). We have also provided an interactive HTML dependency graph of the Ecology Under Lake Ice targets analysis at the end of the introductory example document (https://targe ts-ecolo gy.netli fy.app/closi ng-thoug hts.html).

| Limitations and challenges
There will be situations where WMS may be more or less appropriate because ecology and evolution researchers have wide-ranging levels of experience and comfort with programming (Barone et al., 2017).
For example, low-complexity workflows may not be the ideal candidates for using pipeline software due to the comparatively low return on startup effort. Conversely, large projects may benefit from WMS pipelines' modular, function-oriented structure, enabling users to efficiently scale up analyses. Regardless of the task's complexity, researchers will still need to overcome the learning curve of incorporating WMS into new and existing projects.
When a targets workflow has matured to the point of publication, users may want to consider the best method for sharing the pipeline.
Some decisions need to be made that can affect the accessibility as well as the durability of the pipeline for later use. For example, version control of the data files created by targets is risky due to their size (Landau, 2021b). Containerization is one available option for targets (Landau, 2021b) and other WMS (Leipzig, 2017). Researchers using publicly available datasets might also consider the initial data download as a workflow target in their pipeline to reduce the number of files shared alongside the pipeline itself. The tarchetypes package (Landau, 2021a) provides a helper function, tar_download(), which can simplify adding and tracking external downloads in a pipeline.
The above examples illustrate the asymmetrical nature of adopting new WMS on a large scale. Some research groups may be well situated for quick adoption and troubleshooting of common problems. Others will benefit from the sharing of knowledge and computational resources to adopt these types of workflows. There is a need to develop (1) clear incentives for WMS implementation and (2) training materials that demystify their functionality because WMS both undergo continuous development and are infrequently used in ecology and evolution. Without these two resources, WMS' broader usability is likely to remain limited. training for a larger number of learners and the potential for newly engaged learners to lead future workshops. Once the investment in learning WMS is made, like any tool, its use will become easier over time, such that the skilled research team will be well poised to take full advantage of the increasing research opportunities afforded by new sources of high-quality data.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
R code for Figures 1 and 2 is available in a GitHub repository (https://github.com/mbrou sil/eco_wms) and Zenodo archive (http:// doi.org/10.5281/zenodo.7761195; Brousil, 2023). Information on the Web of Science search displayed in Figure 2 is included as Supplemental Information.