The nlrx r package: A next‐generation framework for reproducible NetLogo model analyses

Agent‐based models find wide application in all fields of science where large‐scale patterns emerge from properties of individuals. Due to increasing capacities of computing resources it was possible to improve the level of detail and structural realism of next‐generation models in recent years. However, this is at the expense of increased model complexity, which requires more efficient tools for model exploration, analysis and documentation that enable reproducibility, repeatability and parallelization. NetLogo is a widely used environment for agent‐based model development, but it does not provide sufficient built‐in tools for extensive model exploration, such as sensitivity analyses. One tool for controlling NetLogo externally is the r‐package RNetLogo. However, this package is not suited for efficient, reproducible research as it has stability and resource allocation issues, is not straightforward to be setup and used on high performance computing clusters and does not provide utilities, such as storing and exchanging metadata, in an easy way. We present the r‐package nlrx, which overcomes stability and resource allocation issues by running NetLogo simulations via dynamically created XML experiment files. Class objects make setting up experiments more convenient and helper functions provide many parameter exploration approaches, such as Latin Hypercube designs, Sobol sensitivity analyses or optimization approaches. Output is automatically collected in user‐friendly formats and can be post‐processed with provided utility functions. nlrx enables reproducibility by storing all relevant information and simulation output of experiments in one r object which can conveniently be archived and shared. We provide a detailed description of the nlrx package functions and the overall workflow. We also present a use case scenario using a NetLogo model, for which we performed a sensitivity analysis and a genetic algorithm optimization. The nlrx package is the first framework for documentation and application of reproducible NetLogo simulation model analysis.


| INTRODUC TI ON
Agent-based models are an increasingly popular method for understanding complex systems (DeAngelis & Grimm, 2014). They are developed and applied across many different research disciplines from natural sciences over archaeology to socio-economic sciences (e.g. Dislich et al., 2018;Vincenot, 2018). Agent-based models incorporate the heterogeneity of entities at the individual level in order to observe patterns emerging on broader scales (Grimm & Railsback, 2005). Due to the release from computing power constraints in recent years, nextgeneration agent-based models have evolved that are structurally realistic, powerful and detailed enough to address real-world problems (Cabral, Valente, & Hartig, 2017;Grimm & Berger, 2016). However, next-generation agent-based models require reproducability, repeatability and parallelization of model analyses. Access to scripts, runs and results of statistical analyses is a key criterion for reproducible research (Peng, 2011;Sandve, Nekrutenko, Taylor, & Hovig, 2013). Yet, incentives for sharing code of agent-based models and model analyses alongside publications are still lacking (Janssen, 2017).
A widely used programming language to develop agent-based models in ecological and socio-economic sciences is NetLogo (Abar, Theodoropoulos, Lemarinier, & O'Hare, 2017;Vincenot, 2018;Wilensky, 1999). NetLogo is a Java-based modelling environment that features a very comprehensible syntax, which allows for fast prototyping of agent-based models but also offers capabilities to formulate and implement complex agent-based models efficiently (Railsback et al., 2017). With rising complexity of NetLogo models, increasing efforts need to be spent on model analyses and documentation of such analyses. Model analysis is a crucial part of the modelling cycle, not only for understanding model processes but also during model refinement and development (Grimm & Railsback, 2005). Often sensitivity analyses are the central part of such refinements, because they give access to detailed information on the relative importance of model parameters for model outputs (Saltelli et al., 2008). Sensitivity analyses need to simulate thousands of different parameter value combinations to gain a better understanding of the modelled systems. Unfortunately, the capabilities of the built-in tool of NetLogo to run such analyses, the NetLogo BehaviorSpace, are limited. If more than one parameter is changed within one experiment, BehaviorSpace automatically creates a full-factorial parameter matrix including all possible combinations of parameter values. This might be generally suitable for simple models, but with rising model complexity more efficient ways of scanning the parameter space and a better control of the design of parameter matrices are required. Additionally, post-processing of NetLogo model output is mostly done within statistical software, such as r (the most prominent programming language for ecologists; Lai, Lortie, Muenchen, Yang, & Ma, 2019). Transfer of the output data from NetLogo into r may not be straightforward, especially when dealing with spatially-explicit model output. This is because spatial NetLogo output is mostly reported in nested lists which need to be parsed and split to convert the data into a usable format.
The r package RNetLogo currently is the only r package available to set up NetLogo model simulations directly from r (Thiele, Kurth, & Grimm, 2012, 2014. However, the package establishes an interactive connection between the r session and the Java virtual machine that runs the NetLogo model application, which has two main drawbacks: (a) RNetLogo establishes the interactive connection via the rJava package, which may be cumbersome to set up, especially on remote systems, and requires access to system-wide environmental variables. (b) Resource allocation of rJava and RNetLogo may be problematic when running large jobs with many simulations. Memory may not be cleared efficiently between runs and Java virtual machines may freeze due to memory constraints. Furthermore, RNetLogo does not provide a convenient workflow to set up experiments with minimal coding, utility functions to generate simulations, or features to enable reproducible research (see a full comparison in Table 1).
We have, therefore, developed the r package nlrx to provide a flexible framework for self-contained and reproducible analysis of NetLogo models from r, while also delivering performance gains (see Figure 1 and Supplementary Material S1). The nlrx framework serves four main needs for next-generation NetLogo modellers: (a) it provides an interface between r and NetLogo. Simulations are completely defined and stored within r objects. This allows the application of a huge toolbox of statistical methods to create simulation experiments and corresponding parameter input matrices. The nlrx package utilizes the controlling API of NetLogo's BehaviorSpace to set up and run experiments. Thus, in contrast to RNetLogo, the connection between r and NetLogo is not interactive and does not rely on the rJava package. (b) it enables reproducible research (Sandve et al., 2013) by storing a simulation experiment in one single r object (including parameter input matrix, applied simulation method and simulation results) and allows easy sharing of code and results among colleagues, with students and for publication. (c) nlrx enables utilization of high performance computing clusters (HPCs) in a very convenient way (see Supplementary Material S2). (d) nlrx provides post-processing analysis functions of NetLogo model output for several applications, such as sensitivity analyses and calculation of descriptive statistics. nlrx also enables gathering spatial output from models, such as coordinates and properties of individuals and patches and provides functions to convert the output into spatial r objects (raster and vector data).

| THE r PACK AG E n l r x
The nlrx package utilizes the NetLogo BehaviorSpace interface by calling NetLogo from the command line via a bash script (*.sh, linux) or a batch file (*.bat, windows). By default, NetLogo BehaviorSpace experiments are stored in XML format within the *.nlogo model file.
However, when NetLogo is started in headless mode from command line, it is also possible to commit XML BehaviorSpace experiment files to setup and run experiments. The nlrx package creates such XML BehaviorSpace experiment files dynamically in order to run simulation experiments from r.

| nlrx workflow
The typical workflow of the nlrx package starts with creation of an nl object, which is an S4 class object that stores basic information on NetLogo (see Figure 2, and Table 2). The nl object contains information on the NetLogo version (nlversion), the path to the NetLogo directory (nlpath), the path to the NetLogo model file (modelpath) and the amount of reserved memory for each Java virtual machine (jvmmem).
Next, an experiment needs to be created and attached to the nl object (see Figure 2, and Table 3). The experiment object stores all information that would be entered into a typical BehaviorSpace experiment (see Table 3). The experiment object contains the name of the experiment (expname), the directory where output is written to (outpath), the number of repeated runs within BehaviorSpace (repetition), a logical variable whether output should be measured on each tick or on the final tick only (tickmetrics), names of setup and go procedures (idsetup, idgo), the maximum number of ticks that should be simulated (runtime, 0 = infinite), a vector of ticks for which output is reported (evalticks), a stop condition (stopcond), output metrics (metrics, metrics.turtles, metrics.patches, metrics.links), variable parameters and corresponding value ranges (variables), and constant parameters (constants).
Finally, a simdesign needs to be attached to the nl object (see Figure 2, and Table 4). The nlrx package provides many helper functions to create pre-defined simdesigns for simple parameter screenings (simple, distinct, full-factorial), sensitivity analyses [Morris (Morris, 1991), Sobol (Sobol, 1990), eFast (Saltelli, Tarantola, & Chan, 1999)] or parameter optimizations [simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983), genetic algorithm (Holland, 1992)]. These helper functions take information from the experiment (variables, constants) to create a simdesign object that contains the name of the simulation design method (simmethod), a vector of random seeds (simseeds), the generated parameter matrix (siminput) and an empty tibble slot to attach output results after simulations have been finished (simoutput). The siminput and simoutput slots store tibble data, a Possibility to perform model analyses in a non-interactive, reproducible way. Means to control for random number generator seeds and exchange of model files. --✓

Spatial integration of agent metrics
Coerce NetLogo agent metrics to geospatial data objects (vector and raster data). --✓

Tidy data format
Data format in which each observation is a unique cell and each variable a unique column (Wickham, 2014). --✓

Utility functionality
Functionality to perform pre-and post-processing of simulation data.
--✓ F I G U R E 1 Comparison of execution time (top panel) and memory usage (bottom panel) of nlrx (blue) and RNetLogo (red). Benchmarks were calculated on different machines with various processor types, available memory and operating systems (details see Supplementary Material S1, boxplots show variation between systems). Both packages were used to execute NetLogo model simulations with the Wolf Sheep Model from the NetLogo Models Library. With each package, we simulated eight replications of a Latin-Hypercube Sampling design with 100 samples and a run time of 500 ticks of each simulation. Available memory was measured before starting simulations (Pre-simulation) and after simulations were completed (Post-simulation). A third measurement of available memory was taken after manually executing the r garbage collection function gc (Post-gc) modern representation of rectangular data in r (Müller & Wickham, 2018). The sensitivity analyses and optimization simdesign helper functions also create a simulation object (simobject) that contains further information needed for post-processing, such as calculation of sensitivity indices. The optimization algorithms do not generate a parameter matrix in advance, because the parameterization of later runs depends on the results of previous runs.
It is also possible to write user-defined helper functions to create simdesign objects. After initialization of the nl object, simulations can be run by calling one of the run _ nl _ *() functions that come with nlrx. For simdesigns that generate a parameter matrix, run _ nl _ one() can be used to execute one specific row of the simdesign parameter matrix siminput by specifying the row ID and a model seed.
This can be useful for testing specific parameterizations. The function run _ nl _ all() executes all simulations from the parameter matrix siminput across all seeds. run _ nl _ all() uses the future _ map _ dfr() function (furrr package) to loop over random seeds and rows of the siminput parameter matrix (see Figure 2).
This allows users to run the function sequentially or in parallel, on local machines and remote HPC machines. Parallelization options can be easily adjusted by using different future plans before calling the function (for more details see documentation of r package future). For optimization simdesigns that do not come with a pregenerated parameter input matrix, run _ nl _ dyn() can be used to execute model simulations with dynamically generated parameter settings.
The connection between r and NetLogo is realized within these three run _ nl _ *() functions. First, a BehaviorSpace experiment XML file is created from the information that is stored within the provided nl object. This XML file contains the random seed, parameter settings and experiment settings. Next, depending on the operating system, a bash script or batch file is generated that is used to execute NetLogo via the command line in headless mode. The file contains information that is stored within the nl object (such as NetLogo path, model path, Java virtual machine memory and the F I G U R E 2 Workflow of the nlrx package. The nl object stores all the information needed to run NetLogo models. The experiment object, which contains all model-specific information, such as names of setup and go procedures, runtime, variable range and metrics and the simdesign object, which contains all simulation-related information, such as input parameter matrix, simulation method and simulation output, are nested within this nl object.

| Further functionality
nlrx offers a function unnest _ simoutput() to turn agent-specific metrics that were collected during the simulation into a wide-format table that splits every type of agent, patch and link in a unique column and nests all measured variables in this column. To be able to derive this information, the nl object offers slots to specify agent, patch and link metrics. The unnested spatial data can easily be visualized to illustrate model behaviour ( Figure 3). Furthermore, the functions nl _ to _ points(), nl _ to _ raster() and nl _ to _ graph() coerce the nl object directly into a spatial data type (e.g. spatial points and raster data for agents and patches, as well as undirected network graphs for links). To be able to use nlrx as a fully reproducible framework, the functions export _ nl() and import _ nl() Valid NetLogo reporters that are used to collect output data metrics.turtles list("turtles"=c("who","pxcor","pyco r","color")) A list with named vectors of strings defining valid turtles-own variables that are taken as output measurements metrics.patches c("pxcor","pycor","pcolor") A vector of strings defining valid patches-own variables that are taken as output measurements metrics.links list("links"=c("end1","end2")) A list with named vectors of strings defining valid links-own variables that are taken as output measurements variables list("variable. parameter.1"=list(min=50, max=150, step=10, qfun="qunif")) A list with NetLogo globals that should be varied in a simdesign constants list("constant.parameter.1"=4, "constant.parameter.2"=0.1) A list with NetLogo globals that should stay constant in a simdesign use of agent-based NetLogo models in their research while permitting a workflow that follows modern scientific standards.

| US E C A S E: ANTS MODEL
This section describes two example use cases of nlrx using the Ants

| Sensitivity analysis with nlrx
In order to demonstrate the usability and workflow of the nlrx package, we performed a global sensitivity analysis of the Ants model (Sobol, 1990). We varied the three model parameters population, diffusion-rate and evaporation-rate by applying a Sobol parameter variation approach. As the main output, we measured the simulation time (ticks) until all food sources were completely depleted by the ants.
To set up the sensitivity analysis with nlrx, we first defined an nl object (Listing 1).  In the next step, we attached an experiment to the previously generated nl object (Listing 2). We defined that output should be measured only on the final tick (tickmetrics = "false"). We set the runtime to infinite (runtime, 0 = infinite) and defined a stop condition to stop model execution once no food cell is left in the model landscape (stopcond). As output metrics we measured the number of ticks simulated when the model run finishes (metrics).
To construct an input parameter matrix from the defined exper- This nl object stores all information needed to run the sensitivity analysis. We can execute all simulations by calling run _ nl _ all() (Listing 4). Because in our experiment we set tickmetrics to "false", the output will only be reported for the final simulation step.
In order to execute the analyze _ nl() function, the results tibble first needs to be attached to the nl object (Listing 5).
The Sobol sensitivity analysis revealed a very large main effect of the population parameter on the effectiveness of food collection (measured as number of ticks until all food sources are depleted) by the ant colony (see Figure 4). The more ants were present, the faster the colony depleted all food sources. The two chemical-related parameters evaporation rate and diffusion rate showed only small main effects (see Figure 4). However, we found some interaction effects between population and evaporation rate, albeit at a very low level. There were no interactions between population and diffusion rate and between diffusion rate and evaporation rate.
After running all simulations with nlrx, we store the simulation output as a *.csv file using the write _ simoutput() function (see Listing 6). Additionally, we store the nl object, together with underlying model files as a zip folder using the export _ nl() function.

| Genetic algorithm with nlrx
In order to demonstrate the optimization functions of nlrx, we ap- For this analysis, we do not need to make any changes to the nl object and experiment that we used in the Sobol sensitivity analysis example, thus we can reuse the nl object and just add a different simdesign. In order to set up a genetic algorithm optimization simdesign, we used the simdesign helper function simdesign_GenAlg() (Listing 7). We defined the number of initial parameterizations (pop-Size = 200), the number of iterations (iters = 100), the index of a defined output metric (see metrics slot in experiment) that is used as evaluation criterion (evalcrit), the elitism rate (elitism), the mutation probability (mutationChance) and the number of random seeds for replications of the algorithm (nseeds).
In contrast to sensitivity analysis simdesigns, for optimization simdesigns there is no parameter input matrix set up in advance. Instead, random starting values are chosen within the defined parameter ranges. The parameterization of the next simulation run is dynamically created, depending on the output of the previous simulation. Thus, for optimization simdesigns, the nlrx function run _ nl _ dyn() needs to be used to execute the simulations (instead of run _ nl _ all() which iterates over a pre-generated parameter matrix) (Listing 8).
The results from the genetic algorithm simdesign are reported as a list which contains information of parameterizations and the evaluation criterion of each population of the genetic algorithm. We can also identify the best found parameterization, i.e. the parameterization that resulted in the smallest fitness value.
The genetic algorithm was able to decrease the time to deplete all food sources from more than 3,000 ticks to below 500 ticks over 100 iterations (see Figure 5, upper left). The parameterization of the best found solution had a moderately large population (166), a medium diffusion rate (18.28), and a low evaporation rate (8.41) (see Figure 5, upper right).
To evaluate the stability of the optimization result under different random seeds, we simulated an additional 10 replicates (each with a different random seed) of the best parameterization. In order to run a specific parameterization, the nlrx package provides the simdesign helper function simdesign_simple(), which creates a parameter table for one single simulation that uses only defined values of constant parameters (constants slot of experiment). We removed all variable parameters from the experiment and defined those parameters as constants instead (Listing 9). We also added counts of food sources as output metrics to the metrics slot of the experiment.
F I G U R E 5 A genetic algorithm was applied to minimize simulation time until all food sources were depleted. The figure illustrates the minimization process over iterations of the algorithm (upper left) and the best found parameterization in terms of minimum simulation time (upper right). This best found parameterization was used to run the Ants model with 10 replicates and measuring the percentage of food gathered at each tick (lower left, colours indicate individual simulations). We also aggregated these 10 runs to measure at which time steps the three individual food source clusters were completely depleted (lower right). Dashed lines illustrate final time step of each curve Finally, we used the simdesign_simple() function to attach a simdesign containing a siminput matrix with only one parameterization and a simseeds vector with 10 random seeds (nseeds=10).
Again, we used the run _ nl _ all() function to execute the simulations and calculated the percentage of food gathered at each tick and identified the tick at which specific food source clusters were depleted completely. We observed some variation in total simulation time between these replicates, but all runs lay between 500 and 800 ticks (see Figure 5, lower left). While the first two food sources were depleted very fast in all replicates, depletion times were more variable for the third food source, which had the largest distance from the ant nest (see Figure 5, lower right).

| Conclusion and outlook
Next-generation agent-based models must be reproducible and re- widely accepted and applied within the field of agent-based modelling (Grimm et al., , 2006(Grimm et al., , 2010Müller et al., 2013). However, given the rising importance of reproducible research as a standard in academia, standards for documentation and reproducibility of applied model analysis and corresponding code are still lacking. nlrx is a first step to provide a framework for documentation and application of reproducible NetLogo simulation model analysis. While nlrx mainly provides tools that enable reproducible research, future work might also give recommendations on a best practice approach that also incorporates data management, collaboration and version control.