A Bayesian Zero-Inflated Dirichlet-Multinomial Regression Model for Multivariate Compositional Count Data

The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome data set are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other data sets.


Introduction
The human microbiome is the collection of microorganisms that live on and inside of our bodies.A major aim in human microbiome studies is investigating the feasibility of designing personalized dietary interventions that modulate and maintain the composition of the microbiome to diagnose and treat microbiome-associated diseases (Xu and Knight, 2015).
Despite recent technological and computational advances for human microbiome research, efficacious intervention strategies require a deeper understanding of the dietary factors associated with the composition and function of a healthy microbiome (Johnson et al., 2019).
The methodological developments proposed in this work were motivated by data collected in the Cross-sectional Study of Diet and Stool Microbiome Composition (COMBO), which was designed to explore dietary patterns linked to gut microbial enterotypes (Wu et al., 2011).Analyzing these data is challenged by the large number of potential associations between each dietary factor and each microbial taxon, as well as the compositional structure of the data, overdispersion, and zero-inflation, characteristic of microbiome samples.Our objective is to develop a novel Bayesian zero-inflated Dirichlet-multinomial model to estimate microbial relative abundances and explore the relation between exogenous and endogenous factors and microbial composition in the presence of excess zeros without sacrificing interpretability or imposing limiting assumptions that may bias inference.Our approach differs from existing methods as it simultaneously estimates individual-and population-level microbial abundances, quantifies parameter uncertainty, is able to accommodate and identify covariates associated with microbial abundances as well as potential zero-inflation, and is scalable to the large covariate and compositional spaces encountered in practice.

Related Work
The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application.Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by highthroughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data (i.e., the magnitude of a single component depends on the sum of all the components' counts) as well as overdispersion.A seemingly inconsequential characteristic of the DM distribution is that estimated probabilities for zero counts are strictly positive, even if the true probability of occurrence is zero.While oftentimes overlooked in practice, this limitation has profound implications on modeling and inference (see the Supporting Information for a toy example demonstrating the impacts of ignoring zero-inflation on inference).
Typically, zero-inflated models are constructed as a two-component mixture of a point mass at zero and a sampling distribution for the count data (e.g., Poisson or negative binomial distributions in the univariate setting) (Xu et al., 2015;Zhang and Yi, 2020;Jiang et al., 2021;Shuler et al., 2021).A corresponding latent indicator is introduced to differentiate between "structural" zeros which occur for events that have zero probability and "at-risk" zeros which occur for events that have positive probability but a zero count is still observed.Covariates may affect the sampling distribution of the counts as well as the probability of observing an at-risk observation (Neelon, 2019).In multivariate settings, researchers link, or jointly model, zero-inflated univariate count models via latent parameters which govern the dependence structure between counts (Aitchison and Ho, 1989;Chiquet et al., 2021).These approaches model multivariate counts unconditionally on the total count of a sample and are not fit for settings in which the count probabilities are defined on the simplex.As such, they are not suitable for the compositional count data collected in high-throughput sequencing settings when the total number of reads, or read depth, is fixed (Gloor et al., 2017).
There are few methods available for modeling zero-inflated multivariate compositional count data.Existing methods are limited as they make restrictive assumptions, fail to estimate parameter uncertainty, do not explicitly model zero-inflation indicators, only provide individual-level inference, and/or ignore potential covariates.
Recently, Tuyl (2018) proposed leveraging the neutrality of the Dirichlet distribution to allow compositional elements which have zero counts to potentially take on zero probabilities of occurrence.The task of determining which zero count observations are structural zeros is then cast as a model selection problem.The resulting mixture model is shown to reduce shrinkage when estimating the probability of categories with positive counts in the presence of zero count categories.A major limitation of this approach is that it only provides count probability estimates for a single observation and is therefore not designed to provide population-level inference given a sample of potentially heterogeneous observations.Tang and Chen (2019) introduced a zero-inflated generalized DM model to detect group-wise differential mean and dispersion levels of microbial composition.The generalized Dirichlet is a conjugate prior for the multinomial distribution and is constructed from mutually independent beta distributed variables.Tang and Chen (2019) leverage the construction's stick-breaking formulation to model excess zeros by replacing the beta distributed variables with zero-inflated beta distributed variables.They take an expectationmaximization (EM) approach for estimation, which provides a fast, parallelizable optimization procedure but lacks intrinsic uncertainty estimation.Further, their method is not designed for multiple regression settings.Zhou et al. (2021) similarly proposed zeroinflated DM and Dirichlet-tree multinomial (DTM) models for differential abundance analysis.Their methods rely on a data augmentation strategy that induces dependence on the ordering of the compositional elements, similar to the GDM (Wong, 1998) but unlike typical DM and DTM models.Zeng et al. (2022) recently proposed a zero-inflated probabilistic principal component analysis logistic normal multinomial (ZIPPCA-lnm) model.Their method imposes a lowrank structure on the compositional data that accounts for complex correlation structures among counts and can flexibly incorporate observed covariates.The authors take an empirical Bayes approach for estimation that approximates the likelihood via variational techniques and maximizes the resulting objective function to obtain parameter estimates.They consider a naive mean-field variational approximation which assumes independence among all latent factors and excess zero indicators.To further improve convergence and reduce computation time, the authors impose a hard threshold on the probability of a structural zero within their optimization routine.Subsequently, the method relies on model comparison or cross-validation techniques for selecting the threshold level for excessive zeros, the number of factors in the model, as well as which covariates to include in the model.As a result, their approach may underestimate model uncertainty.Additionally, the current R implementation of the method is only designed to adjust for one observed covariate, precluding its use in multiple regression settings.Ren et al. (2017) also incorporate dependence on latent factors among compositional counts by assuming a marginal, dependent Dirichlet process prior for each composition which is truncated at the total number of observed components.As a result, their approach is able to assign a zero probability of occurrence for zero count categories, but it does not explicitly model zero-inflation indicators to differentiate at-risk and structural zeros which limits inference.Ren et al. (2020) extend the work of Ren et al. (2017) for mixed effects regression models but the approach is only designed to handle small to medium sized covariate spaces as it also relies on model comparison techniques for model selection.
A major limitation of existing methods for modeling zero-inflated multivariate compositional count data in exploratory microbiome research settings is that they are unable to perform variable selection on covariates associated with relative abundances and the probability of being an at-risk zero.While not designed for zero-inflated multivariate compositional count data, there are numerous variable selection methods available to explore potential relations between a high-dimensional set of covariates and microbial abundances using DM regression modeling frameworks and others (Chen and Li, 2013;Wang and Zhao, 2017;Wadsworth et al., 2017;Koslovsky and Vannucci, 2020;Koslovsky et al., 2020a;Osborne et al., 2022;Miao et al., 2020).Recently, Jiang et al. (2021) proposed a Bayesian zero-inflated negative binomial regression model, which is able to identify subsets of taxa that are differentially abundant among subgroups in addition to performing variable selection.While their approach does not accommodate the compositional structure of the microbial abundance data and is not designed to identify covariates associated with potential zero-inflation, it has shown promising variable selection performance for covariates associated with multivariate compositional count data using discrete spike-and-slab prior formulations.
In this work, we propose a novel Bayesian zero-inflated Dirichlet-multinomial (ZIDM) model.While fully Bayesian methods are often criticized and even avoided in high-dimensional settings due to their computational demand, we take special care to devise a scalable approach which can handle the large model spaces encountered in omics research without sacrificing interpretability or imposing limiting assumptions.Specifically, we reparameterize the Dirichlet distribution via its relation to a set of normalized independent gamma random variables.We then replace the gamma distributions with zero-inflated gamma distributions to accommodate excess zeros.Further, we incorporate covariate dependence to model heterogeneity in compositional proportions as well as the probability of observing a structural zero.To increase the scalabilty of our model, we introduce sparsity-inducing priors for corresponding regression coefficients and leverage the Pólya-Gamma data augmentation technique for efficient sampling and interpretability (Polson et al., 2013).Additionally for posterior inference, we propose a novel Metropolis-Hastings update for potentially zero-inflated individual-level relative abundances that can accommodate changes in the dimension of the model space across Markov chain Monte Carlo (MCMC) iterations.
We demonstrate the estimation and selection performance of our model in numerous simulation scenarios and apply our model to zero-inflated microbial abundance data collected in the COMBO study.Compared to existing methods, our approach achieved improved or comparable estimation and variable selection performance on simulated data and higher variable selection stability estimates in application.

Proposed Model
We first introduce the data augmentation technique used for efficient sampling of Bayesian DM models and propose our solution for accommodating zero-inflation.We then extend the model to regression settings and further embed sparsity inducing priors for regression coefficients to handle high-dimensional compositional and covariate model spaces, equipping the model for both confirmatory and exploratory research settings.

Augmented Dirichlet-Multinomial Model
Let z i = (z i1 , . . ., z iJ ) represent a J-dimensional vector of observed multivariate counts collected on the i th observation, i = 1, . . ., N .We assume the counts z i follow a multinomial distribution where żi = J j=1 z ij is fixed, and ψ i = (ψ i1 , . . ., ψ iJ ) with ψ ij ≥ 0 and J j=1 ψ ij = 1.To account for overdispersion in the multivariate count data, a common approach is to assume the compositional probabilities ψ i ∼ Dirichlet(γ i ) with the J-dimensional vector γ i = (γ ij > 0, ∀j ∈ J).Since the Dirichlet is a conjugate prior for the multinomial distribution, the posterior for ψ i also follows a Dirichlet distribution, with posterior mean estimates , for all j = 1, . . ., J. Thus, E p(ψ i |z i ) [ψ ij ] > 0, even if the true probability of occurrence for z ij is zero.This property of the Dirichlet distribution is central to the methodological contributions of this work.
While the conjugacy of the Dirichlet prior for a multinomial distribution can be exploited to help reduce computational demand for posterior inference, this is typically only the case in trivial settings.In practice, hierarchical DM modeling frameworks typically rely on sampling-based methods for inference which are computationally burdensome due to the compositional structure and high-dimensionality of the data and resulting parameter space.Instead of working directly with the Dirichlet distribution, we impose a data augmentation strategy inspired by techniques used in Bayesian nonparametrics (James et al., 2009;Argiento et al., 2015) and detailed in Koslovsky et al. (2020b).The advantages of this approach are two-fold.First, it reduces the computational demand of the resulting MCMC algorithm in DM regression settings.Second, it facilitates an opportunity to introduce a zero-inflation indicator for each compositional element that allows the model to differentiate between a structural and at-risk zero by letting ψ ij potentially take on zero values.
To implement the data augmentation approach, we first define latent variables c ij such that , where c i = (c i1 , . . ., c iJ ) and c ij ∼ Gamma(γ ij , 1).We then introduce auxiliary parameters u i |T i ∼ Gamma( żi , T i ) for i = 1 . . ., N .This approach greatly reduces computational demand and improves mixing by eliminating unnecessary calculations of T i when sampling the posterior distribution and providing Gibbs updates for c ij and u i .See the Supporting Information for more technical details.

Zero-Inflated Dirichlet Distribution
Leveraging the data augmentation technique presented in the previous section, we propose a zero-inflated Dirichlet distribution.Intuitively, we seek an approach that places a point mass at zero for ψ ij when z ij = 0 represents a structural zero.Since Therefore to model potential zero-inflation, we introduce an at-risk indicator variable η ij ∈ {0, 1} for all i = 1, . . ., N and j = 1, . . ., J where η ij = 0 indicates a structural zero (i.e., c ij = 0) and η ij = 1 indicates c ij > 0. Specifically, we assume a zero-inflated Gamma distribution for c ij , c is the probability of a non-zero c ij (structural zero) for the j th compositional element.The superscript (η) reflects the dependence of c i and subsequently the sampling distribution of z i on the at-risk indicator.By assigning zero values to a subset of the compositional probabilities, the dimension of c (η) i and corresponding z i is reduced to J j=1 η ij .Note that when η ij = 1, z ij may still equal zero, but when η ij = 0, z ij = 0.

ZIDM Regression Model
We present a general framework for the proposed ZIDM model in which count probabilities and at-risk indicators depend on covariates.To generate inference on the relation between each compositional element and each covariate, we set λ ij = log(γ ij ) and assume λ ij = x i β γj , where x i = (1, x i1 , . . ., x i,P −1 ) represents a P -dimensional vector of covari-ates for the i th observation including an intercept term and β γj = (β γj0 , β γj1 , . . ., β γj,P −1 ) represents the corresponding covariates' relation with the j th compositional element.By exponentiating λ ij , we ensure positive hyperparameters for the zero-inflated Dirichlet distribution.Here, the exponentiation of a regression coefficient is interpreted as the multiplicative factor of change in the proportion of a compositional element with a one unit change in the corresponding standardized covariate while holding all else constant (Chen and Li, 2013).

Sparsity-Inducing Priors
For high-dimensional covariate spaces, we propose embedding multivariate variable selection spike-and-slab priors for β γj to encourage sparsity in the relation between covariates and the multivariate count data, similar to Wadsworth et al. (2017); Koslovsky et al. (2020b); Koslovsky and Vannucci (2020); Osborne et al. (2022).We assume the covariates' inclusion in the model is characterized by a latent J × P -dimensional inclusion vector ϕ.With this formulation, ϕ jp = 1 indicates that covariate p is associated with compositional element j and 0 otherwise.The prior for β γjp given ϕ jp follows a mixture of a normal distribution and a Dirac-delta function at zero, δ 0 , and is commonly referred to as the spike-and-slab prior (George and McCulloch, 1997;Brown et al., 1998).Specifically, , where σ 2 βγ is a diffuse variance.We assume each ϕ jp follows a Bernoulli prior, p(ϕ jp ) ∼ Bernoulli(w jp ), where w jp ∼ Beta(a ϕ , b ϕ ).
Integrating out w jp leads to p(ϕ jp ) = Beta(ϕ jp + a ϕ , 1 − ϕ jp + b ϕ )/Beta(a ϕ , b ϕ ).Hyperparameters a ϕ and b ϕ can be set to impose various levels of sparsity in the model.Note that covariates, including the intercept term, can be forced into the model by fixing ϕ jp = 1 for implementation of a standard ZIDM regression model.
We incorporate a latent inclusion indicator ζ jp for β θjp to induce sparsity in the covariates associated with at-risk observations.Specifically, we assume ζ jp = 1 forces the corresponding covariate into the model.Note that the covariate set potentially associated with count probabilities may differ from those potentially associated with excessive zeros.

Posterior Inference
For posterior inference, we construct a Metropolis-Hastings within Gibbs sampler.The full joint distribution is defined as where auxiliary parameters ω ij are introduced for each η ij to provide efficient sampling and interpretability of β θj using the data augmentation technique of Polson et al. (2013).A graphical representation of the proposed model is presented in Figure (Fig.) 1.The MCMC sampler used to implement our model is outlined below in Algorithm 1.While similar to the two-component zero-inflated mixture models designed for univariate count data, our approach for multivariate compositional count data differs in that we assume a mixture distribution on the count probabilities as opposed to the sampling distribution of the counts.
As a result, the dimension of the parameter space changes as the MCMC algorithm iterates through various combinations of at-risk observations and structural zeros.Specifically, the dimension of the count probabilities ψ i grows or shrinks as η ij transitions from 1 to 0 or 0 to 1 iteration-to-iteration.To address this, we propose jointly updating η ij and c ij in what we
Jointly update c (η) ij and η ij with an Expand/Contract Step.Update c end for end for Jointly update β γ and ϕ with Between and Within Steps via Savitsky et al. (2011).Jointly update β θ and ζ with Between and Within Steps via Savitsky et al. (2011).end for refer to as an Expand or Contract Step (see Supporting Information for details).Note that our approach is reminiscent of the sampler proposed by Savitsky et al. (2011) to traverse a regression coefficient space whose complexity changes over MCMC iterations.Details of the MCMC algorithm and model identifiability are found in the Supporting Information.
After burn-in, the remaining samples obtained from running Algorithm 1 for M iterations are used for inference.To identify covariates associated with count probabilities and at-risk observations, their corresponding marginal posterior probabilities of inclusion (MPPIs) are empirically estimated by calculating the average of their respective inclusion indicator's MCMC samples (George and McCulloch, 1997).Typically, covariates are included in the model if their MPPI exceeds 0.50 (Barbieri et al., 2004) or a Bayesian false discovery rate threshold, which controls for multiplicity (Newton et al., 2004).
The per-iteration time and space complexity of the proposed Metropolis-Hastings within Gibbs algorithm are linear with sample size N but depend greatly on the sparsity at both levels of the model.For large compositional spaces, the overall time (space) complexity of each MCMC iteration is dominated by the Expand and Contract Step (Within Step for β θjp ), O(J 2 N P ) (O(N P + N J + JP + P 2 )).One of the benefits of the discrete spike-andslab prior is that for sparse models (i.e., P p=1 ζ jp << P and/or P p=1 ϕ jp << P ), the time and space complexities are greatly reduced.See the Supporting Information for details of the model's computational complexity calculations.

Simulations
In this section, we evaluate and compare the performance of the proposed ZIDM model using simulated data in three scenarios with various data generation settings.The first scenario examines the estimation performance of the ZIDM model with respect to populationlevel zero-inflation probabilities, population-level count probabilities, and individual-level count probabilities.Using the proposed model's notation, these quantities are Θ j = 1/(1 + exp(β θj0 )), Γ j = exp(β γj0 )/( J j=1 exp(β γj0 )), and ψ ij = c ij /T i for all i = 1, . . ., N and j = 1, . . ., J, respectively.Note that in this scenario, the ZIDM model only estimates an intercept term in both levels of the model and ignores any potential covariates.We compare the ZIDM model to a Bayesian DM model, Tuyl's approach (Tuyl, 2018), ZIPPCA-lnm (Zeng et al., 2022), and DirFactor (Ren et al., 2017) when applicable.
In the second and third scenarios, we incorporate covariates into both levels of the model and investigate variable selection performance, in addition to individual-level estimation of the zero-inflation probabilities, Θ ij , and ψ ij .To our knowledge, there are no other existing methods that perform variable selection in zero-inflated multivariate compositional regression models for direct comparison.Thus, we compare our model's variable selection performance to a DM regression model with spike-and-slab priors (DMbvs) presented in Wadsworth et al. (2017), the penalized DM approach of Chen and Li (2013) (DMpen), as well as a Bayesian variable selection method for zero-inflated negative binomial regression models recently proposed by Jiang et al. (2021) (ZINB).For clarity, we denote our proposed method as ZIDMbvs when it is used for variable selection.For comparison, we implemented the Bayesian DM model, DMbvs, and Tuyl's approach in Rcpp (Eddelbuettel and François, 2011), similar to our methods (i.e., ZIDM and ZIDMbvs).Implementation of ZIPPCAlnm, DirFactor, and ZINB is achieved via their corresponding R packages, ZIPPCA-lnm, DirFactor-fix, and IntegrativeBayes, respectively.
In each scenario, we simulated various numbers of individuals, N , compositional components, J, and covariates, P .Multivariate count data were sampled from a Multinomial( żi |ψ * i ), where the total number of counts żi was simulated from a uniform distribution with varying upper and lower bounds to induce different levels of zero counts.The individual-specific count probabilities ψ * i were assumed to follow a Dirichlet(γ * i ), where , where γ ij is defined above, η ij ∼ Bernoulli(θ ij ), and d serves as an overdispersion parameter which was set at 0.01, similar to Wadsworth et al. (2017).Thus, the data generating model differs from all methods compared in this study.
Each of the Bayesian methods were run for 20,000 iterations and thinned to every 10 th iteration.This resulted in 2,000 iterations, of which the first 1,000 iterations were treated as burn-in, and the remaining 1,000 used for inference.We assumed weakly-informative diffuse variances σ 2 βγ = 10 and σ 2 β θ = 10 for regression coefficients, unless otherwise specified.We assumed a non-informative prior probability of inclusion at both levels of the model with when necessary, and the intercept terms were forced into the model by fixing their latent inclusion indicators to one.The spike-and-slab prior specifications for ZINB were set similar to the other Bayesian methods for consistency.All regression coefficients were initiated at zero, with the exception of the intercept terms β γj0 and β θj0 , which were simulated from a standard normal.We initialized η ij |z ij = 0 ∼ Bernoulli(0.5)and ω ij = 1.The ZIPPCA-lnm and DirFactor models were run with default settings.Since the true number of factors is unknown, we fit the ZIPPCA-lnm model with 1 to 5 factors and report the results from the model with the lowest Bayesian information criterion, as recommended by Zeng et al. (2022).The ZIPPCAlnm package provides 95% confidence intervals for Θ j and latent factors using a sandwich estimator but does not provide direct uncertainty estimates for Γ j or ψ ij .DirFactor only provides point and uncertainty estimates for individual-level count probabilities, ψ ij .For Tuyl's approach, we obtained 95% confidence intervals using Monte Carlo sampling with 40,000 iterations.Note that DMbvs, DMpen, and ZINB only perform variable selection for covariates potentially associated with the compositional relative abundances.
To evaluate the estimation performance of the models, we calculated the average absolute value of the difference between the estimated and true probabilities (ABS), Frobenius , and 95% coverage probabilities (COV), where ρij and ρ 0ij , represent estimated and true probabilities, respectively.We adjusted these metrics for population-level parameters as necessary.For variable selection, the methods were assessed on the basis of sensitivity (1 -false negative rate), specificity (1 -false positive rate), Matthew's correlation coefficient (MCC), and F1 score (two measures of overall selection accuracy).These are defined as Sensitivity = , F 1 = 2T P 2T P +F P +F N , where TN, TP, FN, and FP represent the true negatives, true positives, false negatives, and false positives, respectively.Additionally, we compare the computation time of each method run on an Intel Xeon Bronze 3204 1.9 GHz processor with 16 GB RAM.Results we report below were obtained by averaging over 100 replicated data sets for each setting.

Scenario 1
In this section, we evaluate and compare the estimation performance of the proposed method.We first set N = 100 and J = 50, where żi ∼ Uniform(400, 500) with baseline zero-inflation parameters, β θj0 , set to range between 0 and 1, inducing 50% zeros of which 25% were at-risk on average.Baseline compositional count parameters β γj0 were randomly sampled from Uniform (-2.3, 2.3).Results of this setting are presented in Table 1.Overall, we found that ZIDM better estimated the population-level zero-inflation probabilities Θ j and count probabilities Γ j compared to the ZIPPCA-lnm and the DM models, respectively.
Note that the DM (ZIPPCA-lnm) model does not provide estimates for Θ j (Γ j ).All five methods performed relatively well estimating the individual-level count probabilities ψ ij , with DirFactor and Tuyl's approach demonstrating a slight advantage.Note that neither of these methods provide estimates for Θ j and Γ j .Our approach was able to obtain nominal coverage probabilities for Θ j and Γ j , but all methods obtained roughly a 30% coverage probability for ψ ij (excluding ZIPPCA-lnm which does not estimate individual-level count probability uncertainty).The DM, ZIPPCA-lnm, and ZIDM methods took roughly 1, 2, and 4 minutes to run, respectively.Tuyl's approach provided point estimates in less than a second but required around 7 minutes to generate the Monte Carlo samples for uncertainty estimation.DirFactor took over 25 minutes to generate the 20,000 MCMC samples.
We further examined the estimation performance of the model in a variety of settings which are detailed in the Supporting Information.Briefly, we explored the models' estimation performance with varying levels of at-risk zeros and sample sizes, as well as with different data generation processes including under the assumption of the ZIPPCA-lnm model and a negative multinomial distribution.Overall, the relative performance of the methods was quite similar as in the above settings, and the proposed model was fairly robust to model misspecification.For larger sample sizes (i.e., ≥ 500 compositional compo-Table 1: Simulation Results: Parameter estimation performance in Scenario 1 for N = 100 observations and J = 50 compositional components with 50% zero cells of which 25% are at-risk on average.ABS -absolute value of the difference between the estimated and true probabilities; FROB -Frobenius norm; SIMP -Simpson's index mean squared error; COV -95% coverage probabilities.Time is in seconds (s).Time for Tuyl's approach refers to point estimate runtime with Monte Carlo sampling runtime in parentheses.

Model
Parameter nents), Tuyl's approach and ZIPPCA-lnm often failed to provide results, due to memory constraints, numerical issues, and/or failed convergence.However, results were comparable to the proposed method, DM, and DirFactor, when available.

Scenario 2
In the second simulation scenario, the total number of counts, żi , were simulated from a Uniform(1000, 2000) with baseline zero-inflation parameters, β θj0 , and compositional count parameters, β γj0 , set similar to Scenario 1.In each of the 100 replicate data sets, we set 16 of the J * (P − 1) regression coefficients to be active in both levels of the model (in addition to the intercept terms).Corresponding regression coefficients were randomly sampled from ±[0.9, 1.5], and the covariates' correlation was specified with σ = 0.3.
The selection performance results of our simulation study with varying numbers of ob- and ZINB in terms of selection performance for covariates associated with the count data in the presence of zero-inflation.In results not shown, we found that the selection performance was similar with larger covariate spaces.Additionally when data were simulated without zero-inflation, we found that our proposed model maintained similar performance to DMbvs and outperformed DMpen and ZINB in terms of sensitivity and specificity (Web Table S1).For large J settings, we found that DMpen and ZINB often failed to provide results due to due to memory constraints, numerical issues, and/or failed convergence.

Scenario 3
In the third simulation scenario, we generated data similar to the application data (Section 4) with the total number of counts, żi , simulated from a Uniform(1100, 15000) and baseline zero-inflation parameters, β θj0 , set to induce varying levels of zero-inflation (including a scenario with 30% zeros, similar to the application data).The compositional count parameters, β γj0 , were set similar to Scenario 1.In each of the 100 replicate data sets, we set 16 of the J * (P −1) regression coefficients to be active in both levels of the model (in addition to the intercept terms).Corresponding regression coefficients were sampled from ±[1.0, 3.0], and the covariates' correlation was specified with σ = 0.8.
The selection performance results of our simulation study with data generated similar to the structure of the application data are presented in Table 3.We found that the overall variable selection performance (MCC and F1) of the proposed method remained consistent across varying levels of zeros and at-risk zeros and performed the best overall.While the ZINB method often obtained higher sensitivity than the proposed method in this setting, it underperformed with respect to specificity.We also observed that the sensitivity for all methods except DMpen increased as the number of zeros in the data decreased.
Additionally to assess the models' selection performance with misspecification, we generated data from a negative multinomial distribution with varying levels of random noise introduced for the covariate dependent count probabilities.We observed similar results as Scenario 3, in which ZIDMbvs obtained the best performance overall, but ZINB had the highest sensitivity.See the Supporting Information for details.
high carbohydrate and protein/fat/choline diets, respectively.Similar patterns were observed with ZIDMbvs and DMbvs.In contrast to DMbvs, ZIDMbvs identified numerous associations between dietary intake and genus Sutterella, which has been linked to gastrointestinal diseases (Kaakoush, 2020).Further, our proposed method identified 15 associations between dietary factors and the probability of an at-risk observation (Web Figure S5).Here, we found positive relations between carbohydrates (i.e., maltose and added germ from wheats) and an at-risk observation and a negative association between palmitelaidic trans fatty acid and genus Prevotella.Compared to the proposed method, ZINB identified a similar number of associations, and the highest concentration of associations were with Prevotella.On the other hand, DMpen suggested a much sparser model and identified no associations with Prevotella.DMpen identified numerous relations with Bacteroides including positive associations with animal and dairy protein as well as negative associations with maltose, sucrose, and added germ from wheats.
To evaluate the methods' variable selection stability, we applied each method to 100 bootstrapped data sets generated from the application data and calculated the stability estimate, Φ, proposed by Nogueira et al. (2017), which ranges (asymptotically) from 0 to 1 with 1 indicating identical selection patterns across the bootstrap samples.The proposed method obtained a relatively similar stability estimate ( ΦZIDM = 0.2399) compared to the DM-based models ( ΦDMpen = 0.1998, p-value = 0.11; and ΦDM = 0.1982, p-value = 0.09) and a higher stability estimate compared to ZINB ( ΦZINB = 0.0562, p-value < 0.001).
The corresponding p-values were obtained from a two-sided test comparing the variable selection stability of the proposed method and each of the competing methods following Nogueira et al. (2017).Notably all of the methods obtained fairly poor stability estimates (i.e., Φ ≤ 0.40 (Nogueira et al., 2017)), which may potentially reflect the large betweensubject variability typically observed in human microbiome research studies.Additionally in the Supporting Information, we provide plots of the proportion of bootstrapped samples in which the associations identified in the application study were selected using each method (Figures S6-S9).We provide a sensitivity analysis for the proposed method on the application data in the Supporting Information.

Discussion
In this work, we propose a zero-inflated Dirichlet-multinomial model for multivariate compositional count data with excess zeros that provides both individual-and populationlevel inference without making restrictive assumptions or relying on approximation techniques.We then extend our model to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces.In simulation, we demonstrate that our model is able to obtain similar estimation performance for populationlevel zero-inflation probabilities, population-level count probabilities, and individual-level count probabilities compared to existing methods.Notably, our approach is the only method to provide estimates for all of these measures, in addition to simultaneously estimating model uncertainty.While individual-level inference helps capture within-and betweensubject heterogeneity critical for designing and evaluating personalized intervention strategies, population-level inference may help characterize the core microbiome, or a common set of taxa in a given host species or environment, which is a major goal in microbiome research studies (Turnbaugh et al., 2007).Additionally, the proposed method is applicable to other settings that encounter zero-inflated multivariate compositional data in which population-and individual-level estimates may be of interest (e.g., biomedical and public health research, econometrics, and ecology).We show that our method obtains better selection performance than competing methods in various regression settings.Using a combination of multiple data augmentation techniques, our method is designed to scale to large compositional as well as covariate spaces while preserving inference.Further, our approach does not require burdensome tuning procedures for implementation and results were relatively robust to hyperparameter specification.We provide an R-package with a user-friendly vignette that implements the proposed ZIDM and ZIDMbvs models, in addition to the DM model, DMbvs model, and Tuyl's approach for parameter estimation and uncertainty quantification.The vignette contains a step-by-step tutorial demonstrating how to apply the proposed methods on simulated data as well as the gut microbiome data set analyzed in the application study.
The development of a zero-inflated Dirichlet-multinomial model creates ample opportunity for future extensions that will enable more robust analysis of multivariate compositional count data found within and beyond omics research.It is well known that one of the major limitations of the DM distribution is that it does not account for positive and negative correlation structures among counts.However, the DM distribution is essential to the construction of a Dirichlet-tree multinomial distribution, which is able to accommodate more complex correlation structures.Alternatively, the model could be developed with latent factors, similar to (Ren et al., 2017), to accommodate more complex correlation structures among compositional abundances and potentially improve estimation performance.While we showcased the proposed ZIDM distribution's flexibility to handle high-dimensional regression settings, future work could explore the use of the ZIDM distribution as a prior distribution to learn underlying latent structure in hierarchical models.ij for all i = 1, . . ., N and j = 1, . . ., J, respectively.Circular (square) nodes represent random (fixed) variables and shaded (white) nodes represent observed data (parameters).Plates denote replication.N -total observations; J -total compositional elements; P -number of covariates including the intercept term.

Figure 2 :
Figure 2: Application Results: Genus-level estimates of individual-level relative abundances for the application data with the proposed ZIDMbvs model.inc.sed.-incertae sedis.

Figure 3 :
Figure 3: Application Results: Dietary covariates identified as associated with relative taxa abundances using the ZIDMbvs model.inc.sed.-incertae sedis.

Table 2 :
Simulation Results: Variable selection performance in Scenario 2 for covariates in the zero-inflation and DM portions of the model with corresponding coefficients β θ and β γ , respectively.SENS -sensitivity; SPEC -specificity; MCC -Matthew's correlation coefficient; F1 -F1 score.components,andcovariates are presented in Table2.Note that only the proposed method is able to perform variable selection on covariates potentially associated with the at-risk indicators.Here, we observed relatively stable specificity levels across simulation settings, but the sensitivity of the proposed method improved with increased sample size, as expected.We found that ZIDMbvs outperformed DMbvs, DMpen,

Table 3 :
Simulation Results: Variable selection performance in Scenario 3 with varying levels of zeros and at-risk zeros and for covariates in the zero-inflation and DM portions of the model with corresponding coefficients β θ and β γ , respectively.SENS -sensitivity; SPEC -specificity; MCC -Matthew's correlation coefficient; F1 -F1 score.