Ecological dynamics imposes fundamental challenges in community‐based microbial source tracking

Abstract Quantifying the contributions of possible environmental sources (“sources”) to a specific microbial community (“sink”) is a classical problem in microbiology known as microbial source tracking (MST). Solving the MST problem will not only help us understand how microbial communities were formed, but also have far‐reaching applications in pollution control, public health, and forensics. MST methods generally fall into two categories: target‐based methods (focusing on the detection of source‐specific indicator species or chemicals); and community‐based methods (using community structure to measure similarity between sink samples and potential source environments). As next‐generation sequencing becomes a standard community‐assessment method in microbiology, numerous community‐based computational methods, referred to as MST solvers hereafter have been developed and applied to various real datasets to demonstrate their utility across different contexts. Yet, those MST solvers do not consider microbial interactions and priority effects in microbial communities. Here, we revisit the performance of several representative MST solvers. We show compelling evidence that solving the MST problem using existing MST solvers is impractical when ecological dynamics plays a role in community assembly. In particular, we clearly demonstrate that the presence of either microbial interactions or priority effects will render the MST problem mathematically unsolvable for MST solvers. We further analyze data from fecal microbiota transplantation studies, finding that the state‐of‐the‐art MST solvers fail to identify donors for most of the recipients. Finally, we perform community coalescence experiments to demonstrate that the state‐of‐the‐art MST solvers fail to identify the sources for most of the sinks. Our findings suggest that ecological dynamics imposes fundamental challenges in MST. Interpretation of results of existing MST solvers should be done cautiously.

Here  !() represents the absolute abundance of species- at time  ≥ 0,  ! is its intrinsic growth rate, which is randomly drawn from a uniform distribution (0,1), if not specified otherwise.The inter-species interactions are encoded in the interaction matrix  = ( !" ) ∈ ℝ #×# , with  !" > 0 (< 0, or = 0) means that species-j promotes (inhibits or does not affect) the growth of species-i, respectively.To generate the matrix , we first generate the underlying ecological network () using an Erdős-Rényi random graph model 2 with  nodes (species) and connectivity  (the probability of randomly connecting two nodes).Then for each link ( → ) ∈ () with  ≠ , we draw  !" from a normal distribution ℕ(0,  ' ).The standard deviation  of this normal distribution represents the characteristic inter-species interaction strength.To ensure the stability of the system, the diagonal elements of  are set to be  !! = − in tuning  or  !! = − in tuning .Here  is a positive constant.All other entries of  are set to be zero.
We generated  source communities,  % ,  ' , ⋯ ,  ( , each with  ) species drawn from a pool of  = 90 species.To simplify the MST problem, the intrinsic growth rates of all species were set to be identical ( = 0.5).The composition vectors of  % ,  ' , ⋯ ,  ( (denoted as  (%) ,  (') , ⋯ ,  (() , respectively) were obtained by running the GLV model (i.e., numerically solving the ordinary differential equations (ODEs) in the GLV model) with initial species abundances randomly chosen from a uniform distribution (0,1), until a steady state was reached and then normalizing the steady-state abundance of each species by the total biomass of the community.
3) Run the GLV model until it reaches a steady state and normalize the steady-state abundance vector by the total biomass of the sink community to get the final composition of the sink community.
Consider a particular mixing order  among the total ! mixing orders.Let () denote the label of the -th source in the mixing order., () ∈ {1, ⋯ , }.The sink obtained by mixing the  sources in the order  was simulated as follows: 1) The mixing proportions of the  sources were set to be equal:  = % ( ,  = 1, ⋯ , . 2) The initial abundance vector of the sink community is determined by the composition of the first source in the order , i.e., (1), as  - (%) =   (.(%)) .Then we run the GLV model until it reaches a steady state.Denote the steady-state abundance vector as  )) (%) .
4) Repeat step-3 until all the  sources have been added to the sink.Note that right after the arrival of the -th source, the abundance vector of the sink community becomes  - (() =  )) ((/%) +   (.(()) .Then we run the GLV model until it reaches a steady state.Denote the steady-state abundance vector as  )) (() .
5) Normalize the final steady-state abundance vector  )) (() by the total biomass of the sink community to get the final composition of the sink community.
Since the input data of MST solvers is the OTU count table, for both sink and source communities, we converted the species relative abundances into counts by multiplying the absolute abundances and a fix number (1,000 in all the simulations) and rounding to the nearest integers as the synthetic count data generated by the GLV model.

2.
Microbial interactions affect the assembly of the sink community.
The deep reason why the existing MST solvers are almost doomed to fail in the presence of microbial interactions is that the true contributions of different sources are only reflected in the sink's initial composition, which will evolve to a final composition following complex ecological dynamics.In general, the final composition will be quite different from the initial one.Here we sketch a proof.
Let's assume the population dynamics of the sink community can be represented by a set of ordinary differential equations: where () = ( % (), … ,  # ()) represents the abundance vector at time ,  is an unspecified nonlinear function with  encoding all the ecological parameters, i.e., intrinsic growth rates, and intra-and inter-species interaction strengths.After a small time-step , the abundance vector of the sink can be approximated as () = (0) +  ((0); ).The ratio of relative abundance for any species pair (, ) in the initial community is 3 " (-) , while after  the ratio becomes: If  !(0) =  " (0) and  !((); ) =  " ((); ), then we have () = (0).But the condition !() =  " (), then we have () = (0).For a general population dynamics model, this requirement means that there are no inter-species interactions and the intrinsic growth rates of different species are identical, which is also too strong to be true.Hence, in general () ≠ (0), and the final composition of the sink community will be quite different from its initial composition.

3.
Priority effects affect the assembly of the sink community.).For each mixing order, we assume the arrival time of the three sources are 0, , 2, respectively, where  is a constant (and is large enough for the resulting sink community to reach a steady state).Suppose we use the composition of the final sink community taken at 3 to estimate the contribution of each source.We want to prove that at time  = 3 , the sink communities resulted from different mixing orders will have different compositions, even in the absence of any microbial interactions.To achieve that, let's compute the ratio between the relative abundance of species- and that of species- in the final sink community at time  = 3, i.e.,  !" (3) = .
Therefore, even in the absence of any microbial interactions, different mixing patterns will result in different final compositions of the sink community, which are also different from that obtained by simultaneous mixing.with  the desired mean growth rate (tuned from 10 /B to 0.5).For the species pool, we set the network connectivity  = 0.1.The diagonal elements of the interaction matrix  were set to be  !! = −5 to ensure the stability of the community, and the characteristic interaction strength was set to be  = 0.01.Each column (from left to right) represents a particular level of species overlap between 10 sources: None (each of the 10 sources consists of a unique collection of 9 species chosen from the species pool of size  = 90, so any two sources have no species overlap); Moderate overlap (for each source, we randomly drew 20 species from the species pool); High overlap (for each source, we randomly drew 50 species from the species pool).Among the 10 sources, we randomly selected 3 sources ("true sources") and mixed them with random proportions ( % ,  ' ,  4 = 1 −  % −  ' ) to form a sink.We repeated this 100 times to form 100 sinks.
Accuracy is defined as the fraction of those 100 sinks whose true sources were all correctly predicted (i.e., among all the 10 possible sources, the three sources with the highest predicted contributions are the true sources). ' is the coefficient of determination between the estimated source contributions and true source contributions.strength.Each of the 3 sources consists of a unique collection of 30 species chosen from the species pool of size  = 90, so any two sources have no species overlap.We first generated 100 initial sinks by mixing 3 sources with random proportion, then the community of each source will evolve along time.We used the sink communities at different time steps into the source tracking problem.
' is the coefficient of determination between the estimated source contributions and true source contributions.

Figure S1 :
Figure S1: Impact of positive species' mean growth rate on community-based MST.a-b, Performance of SourceTracker (red) and FEAST (blue) evaluated using Accuracy (a) and  ' (b)as a function of the species' mean growth rate  for a pool of  = 90 species.In all the simulations, we randomly drew the intrinsic growth rate  ! of species- from a uniform distribution (0, 2)

Figure S2 :
Figure S2: Impact of time scale on the FEAST.a-b, Performance of FEAST evaluated using  'as functions of time step from the initial sink community and the species' mean growth rate  for a pool of  = 90 species.In all the simulations, we randomly drew the intrinsic growth rate  ! of species- from a uniform distribution (0, 2) with  the desired mean growth rate (tuned from 10 /B to 0.5).For the species pool, we set the network connectivity  = 0.1 (a) and 0.3 (b).The diagonal elements of the interaction matrix  were set to be  !! = −5 to ensure the stability of the community.Each column (from left to right) represents a particular characteristic interaction

Figure S3 :
FigureS3: Evaluation of FEAST using FMT data from Staley et al.3 True donor (red cycle) vs. predicted donor (green square) of each recipient.For each post-FMT community (sink), among all the 7 donors, we referred to the one whose fecal sample has the highest contribution estimated by FEAST as the "predicted donor".In Fig.4c, we presented results of the first 65 sinks.Here, we showed the results of the remaining 194 sinks.Sources: microbiome samples of donors and the pre-FMT samples of recipients; Sinks: post-FMT samples of recipients.

Figure S6 :
FigureS6: Evaluation of SourceTracker using FMT data from Sharon et al.4 True donors (red cycle) vs. the predicted donor (green square) of each recipient sink given by SourceTracker using the source and sink compositions as the input.For each post-FMT community (sink), among all the 8 donors, we referred to the one whose fecal sample has the highest contribution estimated by SourceTracker as the "predicted donor".Sources: microbiome samples of donors and the pre-FMT samples of recipients; Sinks: post-FMT samples of recipients.In total, there are 106 sinks.

Figure S7 :
Figure S7: Principal Coordinate Analysis (PCoA) plot of the sinks and sources in the community coalescence experiments.a, Pairwise mixing.b, Quadruple-wise mixing.

Figure S9 :
FigureS9: Evaluation of community-based MST solvers in non-mixing dataset.The samples of 8 subjects were transferred in 10 days.We considered the 8 samples collected at day 0 as sources and the samples collected at each transfer as sinks.Accuracy is defined as the fraction of those sinks (number around each point) whose true sources (sample at day 0) were correctly predicted (i.e., among all the 8 possible sources, the source with the highest predicted contributions is the true sources).The number represents the total number of sinks.

Figure S11 :
Figure S11: Performance of SourceTracker in identifying sources in pairwise community coalescence experiments.True sources (red cycles) vs. predicted sources (green squares) of each sink.For each sink, among the 24 known sources, the two sources with the top-two largest contributions predicted by SourceTracker were referred to as the predicted sources.

Figure S12 :
Figure S12: Performance of FEAST in identifying sources in quadruple-wise community coalescence experiments.There are 24 source communities (stool samples from 24 healthy individuals).Each sink community is obtained by mixing four different source communities ex vivo.The final composition of each sink was obtained from metagenomic sequencing of samples collected after 11 days of the ex vivo mixing.True sources (red cycles) vs. predicted sources (green squares) of each sink.(Each row includes 75 sinks).For each sink, among the 24 known sources, the four sources with the top-four largest contributions predicted by FEAST were referred to as the predicted sources.

Figure S13 :
Figure S13: Performance of SourceTracker in identifying sources in quadruple-wise community coalescence experiments.There are 24 sources communities (stool samples from 24 healthy individuals).Each sink community is obtained by mixing four different source communities ex vivo.The final composition of each sink was obtained from metagenomics sequencing of samples collected after 11 days of the ex vivo mixing.True sources (red cycles) vs.predicted sources (green squares) of each sink.(Each row includes 75 sinks).For each sink, among the 24 known sources, the four sources with the top-four largest contributions predicted by SourceTracker were referred to as the predicted sources.

Figure S14 :
Figure S14: Relative abundances of common ASVs shared by sources and sinks.For each sink and source pair, we identified their common ASVs and calculated the total relative abundance of those common ASVs.Each boxplot represents the total relative abundance of common ASVs shared by this source and each of the 256 sinks in the pairwise community coalescence experiments (a); and 225 sinks in quadruple-wise community coalescence experiments (b).
Consider three source communities  % ,  ' and  4 .Let's assume species- is only present in source  % and species- is only present in source  ' .We mix the source communities in 6 different orders but with identical mixing proportions ( % 4