miaSim: an R/Bioconductor package to easily simulate microbial community dynamics

Microbiomes never stop changing. Their compositions and functions are shaped by the complex interplay of intrinsic and extrinsic drivers, such as growth and migration rates, species interactions, available nutrients and environmental conditions. Mathematical models help us make sense of these complex drivers and intuitively explain how, why and when specific microbiome states are reached while others are not. To make simulations of microbiome dynamics intuitive and accessible, we present miaSim. miaSim provides users with a wide range of possibilities to match their specific assumptions and scenarios, starting from a core implementation of four widely used models (namely the stochastic logistic model, MacArthur's consumer‐resource model, Hubbell's neutral model and the generalized Lotka‐Volterra model) and several of their derivations. The diverse model implementations share the same data structures and, whenever possible, share state variables, which significantly facilitates cross‐model combinations and comparisons. We combined and simulated some published examples of microbiome models in miaSim and performed cross‐model comparisons and tested diverse model assumptions. Our examples illustrate the reliability, robustness and user‐friendliness of the package. In addition, miaSim is accompanied by miaSimShiny, which allows users to explore the parameter space of their models in real‐time in an intuitive graphical interface. miaSim is fully compatible with the ‘miaverse’, an R/Bioconductor framework for microbiome data science, allowing users to combine and compare model simulations with microbiome datasets. The stable version of miaSim is available through Bioconductor 3.15, and the version for future development is available at https://github.com/microbiome/miaSim. miaSimShiny is available at https://github.com/gaoyu19920914/miaSimShiny. We anticipate that miaSim will significantly facilitate the task of simulating microbiome dynamics, highlighting the role of ecological simulations as important tools in microbiome data science.


| INTRODUC TI ON
Microorganisms are key players in the biosphere and greatly impact human life, environment and society (Arumugam et al., 2011;Logan & Rabaey, 2012;Natarajan & Bhatt, 2020;Patowary et al., 2016;Toju et al., 2018). The rapid development of sequencing and computing technologies contributed to an increase in our understanding of the structure and function of microbiomes in diverse environments, including laboratory incubations, human body sites, activated sludge in wastewater treatment plants, and river beds (Huttenhower et al., 2012;Khazaei et al., 2020;Li et al., 2020;Peces et al., 2022).
However, microbiomes are inherently complex and dynamic systems whose trajectories depend on many interactions between microbes and their environments (Kehe et al., 2021;Palmer & Foster, 2022).
Mathematical models help us make sense of these complex systems, by providing a simplified explanatory framework for their underlying dynamics (Bengtsson-Palme, 2020; Coenen et al., 2020;Song et al., 2014). Over the years, a variety of microbiome models have been proposed that account for particular characteristics of the system (briefly summarized in Table 1). Of note, here we focus on the 'well-mixed' models. For microbiome models that account for spatial structure, see the works of Bauer et al. (2017), Dukovski et al. (2021, Goldschmidt et al. (2021) and Hellweger et al. (2016).
With the help of these diverse models, a number of ecological insights have been obtained by fitting and simulating experimental data. For example, macroecological laws of variation and diversity (Gamma abundance fluctuation distribution, Taylor's law and the mean abundance distribution of different species in microbial communities) were reproduced with a stochastic logistic model (Grilli, 2020), while the dynamics of bacteria that inhibit Clostridium difficile infection in the human gut microbiome were successfully predicted with a generalized Lotka-Volterra model (Buffie et al., 2015), and properties of large-scale microbiome data sets, such as the relationship between richness and environmental harshness, as well as abundance distributions, were reproduced with a consumer-resource model . Recent simulation studies have also explored the effect of ecological memory on microbial communities (Khalighi et al., 2022).
Modelling formalisms focus on different factors shaping microbial community dynamics, but the patterns that these models intend to explain generally reflect diverse eco-evolutionary processes. In some cases, combining and comparing different models is helpful for understanding community dynamics (Greenblum et al., 2013) or for investigating the possible origin of particular properties, for example memory in time-series data . Furthermore, different model formalisms imply different generative processes. For example, simulations of generalized Lotka-Volterra models with random interaction matrices were used to illustrate that communities with numerous interacting species or communities with a few strongly cooperative species tend to be unstable (Coyte et al., 2015;May, 1972), while a consumer-resource model revealed that large communities may be stable when resources are explicitly considered (Butler & O'Dwyer, 2018). These conflicting observations emerge from the simulation of community dynamics with various formalisms, illustrating the potential benefit of interpreting microbiome data in the light of diverse models. Currently, cross-model comparison is limited since the different published models depend on knowledge of their specific interfaces and programming languages.
To fill in this gap, we developed miaSim, an R/Bioconductor package that facilitates the task of modelling and simulation of microbiome time series. Here, we provide detailed explanations of miaSim's models, functions, parameters, and data structures. Moreover, we present case studies using miaSim and its R-Shiny extension miaSim-Shiny, which illustrate how these tools allow researchers to computationally explore a variety of questions in microbial ecology.

| An overview of the package
The workflow of miaSim is described in Figure 1. It consists of preprocessing, simulating and postprocessing steps. Compatibility between different models is facilitated by sharing common names, parameters and data structures. The common data structures also facilitate the downstream comparison and visualization of simulations.
A minimal number of user inputs are required to start simulating the models (Figure 1). Most of the parameters have default values. For instance, to generate a simulation of the generalized Lotka-Volterra model, we only need to specify the number of species (using the ExampleGLV <-simulateGLV(n_species = 4) command). miaSim will automatically generate a random interaction matrix and simulate the model. The time series may then be visualized with downstream tools in R/Bioconductor, for instance using miaViz (mia-Viz::plotSeries(ExampleGLV)). While most models require only the number of species or resources to be specified, the user has control over all the parameters listed in Table 2 (Şimşek et al., 2023). Pre-processing functions with specific controlling variables (e.g. generator of inter-species interaction matrices based on different types, ratios and strengths of interactions) and post-processing functions with different objectives (e.g. converting function and other utilities) are also written separately Most models return the simulation as a TreeSummarized Experiment data object (Huang et al., 2021) containing several elements, including the simulation results, initial states and parameters ( Figure 2). As detailed in the online tutorial, there are also functions available to extract desired time points from a simulation or to combine results across multiple simulations.
Furthermore, there are cases where a large number of repeated simulations are needed. For example, when performing parameter sweeps, miaSim's execution time may be improved by using one of R's parallel looping constructs to execute independent simulations in parallel (see Supporting Information File S1). Execution times are only bottlenecks in simulations with a very large number of species, as illustrated by the execution times of miaSim's Hubbell and GLV models compared to a MATLAB implementation (Supporting Information File S1). In general, the shared data con-

| Logistic model
The logistic model describes the growth of populations without interaction terms. We based the model on the Verhulst equation (Bacaër, 2011), adding terms for migration rates m i and stochastic noise ( ): where N i , r i , and K i are, respectively, the density, specific growth rate and carrying capacity of the ith population. The user can also simulate a logistic model that contains a nonrenewable resource by specifying death rates. In this case, similar to common setups in experimental microbiology, the community will behave like a batch culture where all species die when resources run out.

| Generalized Lotka-Volterra model
The Generalized Lotka-Volterra (GLV) model describes the growth of interacting populations. Interactions are encoded by adding a matrix (A) with positive and negative interaction coefficients. The randomA function generates random interaction matrices with a variety of possible distributions and assumptions, including the choice for a defined proportion of interaction types (e.g. competition, mutualism, commensalism, parasitism and amensalism), random or power-law distributions for the interaction strengths, and the possibility to structure the matrix into subgroups of strongly interacting species. Following the formulation of Fisher and Mehta (Fisher & Mehta, 2014), we define the GLV as where a ij is the interaction coefficient of the jth species on the ith species. By default, the diagonal values of A are fixed to − 0.5, indicating intraspecific competition for habitat, nutrients, etc. A discrete version is also available based on the Ricker model as described by Fisher and Mehta (Fisher & Mehta, 2014).
TA B L E 1 Microbiome models. Exponential growth and logistic growth models describe population dynamics, based on simple phenomenological parameters such as growth rates and carrying capacities. Models based on the generalized Lotka-Volterra model go further by explicitly considering species interactions (Lotka, 1925;May, 1972;Volterra, 1926). Kinetic models generally use differential equations to explicitly describe microorganism growth in terms of metabolite consumption and production (D'hoe et al., 2018;Monod, 1949) and changes in environmental conditions such as pH (Kettle et al., 2018). Consumer-resource models are particular examples of kinetic models that use simplified assumptions for resource dynamics (Chesson, 1990). Hubbell's unified neutral theory of biodiversity describes the dynamics at the per capita level where individuals have equal rates of birth, death, speciation and migration (Hubbell, 2001). Hubbell's neutral model is equivalent to the hierarchical Dirichlet process (HDP) when population sizes are large (Harris et al., 2017). Other important models such as the Genome-scale metabolic models and agent-based models are beyond the scope of this study.

Model formalisms Features Implementations
Exponential growth model Growth of non-interacting single species growthrates (Petzoldt, 2020) Stochastic logistic model Growth with maximum capacity/rate limit optional stochasticity growthcurver (Sprouffske & Wagner, 2016), stochastic logistic model (Descheemaeker & de Buyl, 2020) Generalized Lotka-Volterra model ( The resource stoichiometries are defined by a matrix E, which can be generated using the function randomE. This function allows users to control several parameters, such as the number of metabolites to consume or produce, the ratio between maintenance and consumption, and trophic preferences for species or subgroups of species (e.g. Wang et al., 2019).
The following differential equation defines the growth rates of where r i is the maximum growth rate of species i, while N i is its abundance at time t. e i,j is the yield of substrate j for species i, f i,j is the feeding form, which constrains the quantity of substrate that can be consumed at each time step, and is a dilution term.
The function uses the Monod equation as the feeding form where k i,j is the Monod constant for species i of substrate j and S j is the concentration of substrate j at time t.
Resources change according to the following differential equation: F I G U R E 1 Flowchart of modelling in miaSim. Arrows are possible flows of data, and blocks are available data/models/modules. (a) To start using miaSim, essential variables should be defined (either the number of species or resources and the total time to simulate), while optional variables (e.g. a list of metabolites and their quantities for consumer-resource model, a matrix describing interspecies interactions for generalized Lotka-Volterra model, etc.) will be generated if not given. Coloured blocks near each input indicate their applicability to different models in the next step. (b) A model should be selected among all available models (including the logistic model, the generalized Lotka-Volterra model, the consumer-resource model, Hubbell's neutral model, the Ricker model and the self-organized instability model). (c) The results containing simulation data from different models will be stored in an identical format for one single simulation or a series of different simulations. (d) Finally, data are returned in the TreeSummarizedExperiment format, which can be visualized using miaViz or miaSimShiny. In addition, with the evaluation of modelling results, some parameters can be adjusted, generating a new single simulation or a series of multiple simulations using alternative models. A series of multiple simulations can be generated by a single call to the "generateSimulations" function.
where is a dilution term, S j0 is the concentration of substrate j in the feed, and p i,j is the yield of the production of substrate j that results from the growth of species i.
In practice, a species either consumes or produces a metabolite (or is indifferent to its presence) allowing us to summarize e i,j and p i,j in a single matrix (E, containing the same number of rows as species in the community and the same number of columns as the number of metabolites that are used and produced). To distinguish them, the entries of E are, respectively, positive and negative for the consumed and produced substrates.

| Hubbell neutral model
Hubbell's neutral model treats individuals as functionally indistinguishable units. Community dynamics are simulated by sampling from per capita probabilities of birth, death and migration (Hubbell, 2001). Our implementation is optimized by a chemical stochastic simulation algorithm (Gillespie, 1976), where multiple events may be simulated at a single time point (Cai & Xu, 2007). In the model, the total number of individuals is fixed, so the number of new births or immigrations sum up to the number of deaths during simulation steps. Additionally, users may define species-specific growth rates, which result in nonneutral dynamics but retain the demographic noise derived from using a per capita-based approach (Constable et al., 2016).

| Self-organized instability model
The self-organized instability (SOI) model combines the species interactions of the generalized Lotka-Volterra model with the immigration and extinction properties of Hubbell's neutral model (Solé et al., 2002). At a particular time step in the dynamics of the community, the following events occur in order: (i) immigration.
The colonization of an empty site by species i according to the probability of immigration i . (ii) death. A site occupied by species TA B L E 2 Important input parameters (parameter types corresponding to the classification of inputs in Figure 1a (Cai & Xu, 2007;Gillespie, 1976).

| C A S E S TUD IE S
Next, we illustrate the utility of miaSim, by qualitatively reproducing

| Origin of different community types with strongly interacting species
Different human gut microbial community types (also known as "enterotypes") were reported in healthy individuals (Arumugam et al., 2011). Their correlations with dietary habits and health status motivated the investigation of their differences in taxonomic, functional, and ecological properties (Costea et al., 2018). However, the origin of these different community types is still under debate.
Multistability, host heterogeneity, and interspecific interactions could all be alternative explanations for the genesis of enterotypes F I G U R E 2 Data structures of the outputs in miaSim. (a) A result of the consumer-resource model simulation is a list acquired by running the simulateConsumerResource function once and contains a matrix of community compositions at different time points in the simulation, initial abundances, probabilities of dispersal from the metacommunity, growth rates and the magnitude of the measurement error in the model. (b) A list of multiple simulations generated using a series of inputs by running generateSimulations function. (c) A summary data frame of multiple simulations acquired from a series of simulating results by applying the getCommunity function to a list of multiple simulations. In this data frame, each row represents one independent simulation. (Falony et al., 2016;Gonze et al., 2017). In a series of simulations using the generalized Lotka-Volterra model, Gibson et al. showed that different community types could emerge by combining subsampling from a metacommunity with a heterogeneous distribution of species interaction strength, and suggested that the presence of species that interact strongly with many other species, referred to as strongly interacting species (SIS), could give rise to different stable community types (Gibson et al., 2016).
Here we reproduced their most important simulations using mia-Sim. Local communities with different species interaction strengths were generated using the "randomA" function with a specific custom "interactions" argument. As performed in the original study, we generated interaction matrices by drawing random values from a power law probability distribution P(α). Different values of α {7, 3, 2, 1.6, 1.01} were explored. As shown in Figure 3a, decreasing α leads to the formation of unevenly distributed species interaction strengths.
As a result, the interactions between species changed from approximately even and weak interactions to stronger interactions centered on a few species, leading to the formation of SIS (Figure 3b). Five types of local communities were assembled separately using the five different values of α. For each type, 80 species were randomly drawn from their local species pool to form 100 random initial communities. Then, they were set to the same initial abundances and simulated with the generalized Lotka-Volterra model with their corresponding interaction subnetwork (where only the 80 local species are included). Dimensionality reduction of the final states of these F I G U R E 3 Impact of heterogeneity of interaction-strength on the distinctness of community types. This figure is a recomputation of the model from (Gibson et al., 2016) using miaSim but with different random seeds and specific numerical values. With different interaction strength heterogeneities, we generated a series of inter-species interaction matrices, and subsequently, simulated the corresponding community changes using the generalized Lotka-Volterra model. (a) Histograms of the interaction heterogeneity matrix H with different α (7, 3, 2, 1.6, 1.01) in a power-law distribution. The highest H display an increasing value with α approaching 1 (b) Randomly generated networks of interspecies interactions. Each network contains 100 nodes, indicating 100 species in the local species pool, from which 80 species were chosen randomly 100 times to form 100 initial communities in every interaction strength pattern. From light yellow to dark blue, the edge colours indicate different strengths of interactions between species. (c) Principal coordinate analyses (PCoA) of the final state of communities using the Jensen-Shannon distance. In every plot, each point represents a final community, and number of clusters is determined by the Silhouette Index (SI) acquired using Partitioning-Around-Medoid (PAM) algorithm. SI represents the goodness of clustering. It increases with the decrease of α, implying that the Strongly Interacting Species (SIS) in the last interaction patterns promotes the formation of different community types. simulated communities displayed a gradient of separation into enterotypes ( Figure 3c).
In addition to reproducing the results from (Gibson et al., 2016) in miaSim, we explored the simulation results from miaSim, which showed that the two clusters seen in the principal coordinate analysis (PCoA) for simulations with the last interaction network were separated by the presence/absence of the SIS (Supporting Information Figure S1), which illustrates the mechanisms by which the SIS could contribute to the formation of community types.

| A nutrient concentration threshold in microbial community diversity
An important challenge of microbial ecology is to unravel the relative contributions of abiotic and biotic factors to community diversity (New & Brito, 2020). In different environments (e.g. marine, soil, fresh water, air, human gut, etc.) microbial community diversity is driven by different factors, including the availability of nutrients Crowther et al., 2019;Oliphant et al., 2019;Palmer & Ruhi, 2019;Shade et al., 2013). We used miaSim to explore the hypothetical relationship between nutrient concentration and community diversity, mimicking a gradient from poor (oligotrophic) to nutrient-rich (eutrophic) environments. We explored the hypothesis suggested by Marsland et al. that microbiomes exhibit a threshold for nutrient concentration in the relationship between community diversity and the availability of nutrients or, more generally, "energy fluxes" (Marsland III et al., 2019). Below this threshold, according to Marsland et al, microbiome diversity is externally modulated by the total influx of nutrients, and once the threshold is exceeded, the diversity is controlled by ecological interactions (Marsland III et al., 2019).
To test this hypothesis, we used miaSim to simulate consumerresource models with combinations of 10 initial communities, 10 nutrient types and 6 quantities of nutrients. This simulation does not specifically replicate Marsland III et al.'s (2019) simulation, but only broadly tests the same nutrient threshold hypothesis. In our simulations, each community contained at most 10 species, and to control the effect of the initial composition, the beta-diversity between the first initial community and other randomly generated initial communities were chosen to be in an increasing gradient (the Bray-Curtis dissimilarity index increased from 0.05 to 0.85); the nutrient compositions, each containing five varieties of nutrients, were also structured into a gradient of increasing dissimilarities; and the quantities of nutrients were set as a geometric progression from 1 to 10 5 with a common ratio of 10.
Plotting the results of simulations into a uniform manifold approximation and Projection (UMAP) (Supporting Information Figure S2) or PCoA (Supporting Information Figure S3) (UMAP is recommended for microbiome data; Armstrong et al., 2021), we can observe that an increase in the concentration of nutrients (from 1 to 10 2 ) leads to an increase in the community beta diversity, that is communities spread more in the UMAP projection, but beyond 10 2 the clusters do not change (Supporting Information Figures S4 and S5). To illustrate this trend, we plotted the average distances between simulated communities in the UMAP and PCoA projections against the nutrient concentrations (Supporting Information Figures S6 and S7). In our simulations, the nutrient saturation threshold was near 10 2 .
Next, we focus on the clustering patterns observed in the UMAP projections. Similar to the effect that SIS have on the formation of clusters from metacommunities simulated with the GLV model (Figure 3), both nutrient concentration and diversity lead to the formation of enterotypes in metacommunities simulated with the consumer-resource model (Figure 4). Clusters are frequently observed when there is an increase in resource concentration in different growing environments (cultivating medium; i.e. communities 1 and 2; communities 3, 4 and 7; communities 5, 8 and 9; communities 6 and 10 in Figure 4). This suggests that, according to the consumer-resource model, enterotypes are expected if local communities encounter an environment where resources are abundant, similar to the environment that microbes encounter in the human gut. It is also, to our knowledge, the first time that enterotypes emerge in consumer-resource model simulations that mimic a move from a resource-rich to a resource-poor environment.

| Nutrient complexity impacts the composition and diversity of the microbiome
In addition to the total concentration of nutrients, another factor that could externally modulate the composition of the microbiome is the number of different types of resources that microbes encounter in their environment (Madigan et al., 2018). In contrast to the previous section, where we explored nutrient concentration, here we fixed the total concentrations and varied the number of resource types.
Qualitatively following the simulations from a previous study , we investigated the impacts of environmental complexity (numbers of available nutrients) for different community sizes.
First, we used a consumer-resource model in miaSim that follows Similarly to , we also observed a nonadditive relationship between environment complexity and growth yields ( Figure 5). Next, we explored a wider range of parameters than the ones used by  to better understand the key factors that impact the composition and diversity of the microbiome when increasing the number of alternative carbon sources.
First, we simulated a wider range of niche overlaps (ρ) by extending the average fraction of resources usable by each species (θ) to 0.25, 0.1, and 0.05. Interestingly, we found that in larger communities, the growth yield increased with the number of resources only when θ is relatively small, and the increase is negatively correlated with the value of θ (Supporting Information Figure S8). When θ is large, then the niche overlap is large, too, and most species can simultaneously use most of the resources; in contrast, when θ is relatively small, the resources are only available for their specialists and could be left unused in smaller communities.
To further explore the impact of θ on the diversity of microbial communities, we measured the average Euclidean distance between different initial communities in UMAP and found that larger communities (with 13 species) are less likely to be influenced by increasing the number of resources (Supporting Information Figure S9). The increase in the number of alternative carbon sources also leads to a decrease in the beta diversity, as the average amount of each resource is decreased, causing their consumption and growth to be more evenly distributed between populations of different species.

F I G U R E 4
Uniform Manifold Approximation and Projection (UMAP) of simulated communities in different media (both components and concentrations). In each row, the relative composition of nutrients in the medium is the same, but from left to right, the quantities of resources for microbial communities increase in a geometric sequence. In each subfigure, different symbols represent their different initial community compositions. Each subfigure in the last column presents a common clustering result: four community types (communities 1 and 2; communities 3, 4 and 7; communities 5, 8 and 9; communities 6 and 10) are consistently observed in different environments.

| DISCUSS ION
Unified theories or general concepts are important pursuits in community ecology. A number of patterns initially found in macroecology also apply in microbial ecology, including the species-area relationship, species-abundance distribution, distance-dissimilarity correlation, etc. (Lennon & Locey, 2017). Due to the challenges in handling and analysing large amounts of data acquired from microbiomes, much effort is devoted to reducing the gaps between ecological theories, observed patterns, available tools and accumulating data (Mascarenhas et al., 2020).
For example, the conceptual synthesis of different ecological processes (selection, dispersal, drift and speciation) in community ecology (Vellend, 2010) inspired the development of tools to quantify these processes (Stegen et al., 2013); the discussion of stochasticity in microbial ecology (Zhou & Ning, 2017) paved the way towards quantifying ecological stochasticity using null model algorithms (Ning et al., 2019). We carried out this work F I G U R E 5 Changes in community growth yield when more alternative carbon sources are added. This figure is a recomputation of the model from  using miaSim with a different random seed and slight model differences from the original publication, but qualitatively similar. Box plots contain the distribution of growth yields at the end of simulations of a complex community with 13 species (left) and simple communities with 3 species (middle) and 4 species (right) growing at the same concentration but different types of carbon sources. The central lines indicate the median values, and the top and bottom box edges are the 25th and 75th percentiles, respectively.

ACK N O WLE D G E M ENTS
We thank Didier Gonze for providing the MATLAB implementations of the GLV and Hubbell models that were compared to miaSim (Supporting Information File S1).

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https:// www.webof scien ce.com/api/gatew ay/wos/peer-review/10.1111/ 2041-210X.14129.

DATA AVA I L A B I L I T Y S TAT E M E N T
The version of miaSim referenced by the manuscript as well as the code used to generate the simulations are available and archived in Zenodo (Şimşek et al., 2023), which includes the vignettes to replicate the case studies described in Section 3.