Dirichlet composition distribution for compositional data with zero components: An application to fluorescence in situ hybridization (FISH) detection of chromosome

Abstract Zeros in compositional data are very common and can be classified into rounded and essential zeros. The rounded zero refers to a small proportion or below detection limit value, while the essential zero refers to the complete absence of the component in the composition. In this article, we propose a new framework for analyzing compositional data with zero entries by introducing a stochastic representation. In particular, a new distribution, namely the Dirichlet composition distribution, is developed to accommodate the possible essential‐zero feature in compositional data. We derive its distributional properties (e.g., its moments). The calculation of maximum likelihood estimates via the Expectation‐Maximization (EM) algorithm will be proposed. The regression model based on the new Dirichlet composition distribution will be considered. Simulation studies are conducted to evaluate the performance of the proposed methodologies. Finally, our method is employed to analyze a dataset of fluorescence in situ hybridization (FISH) for chromosome detection.

geology, the budget share patterns of household expenditures in economics, and the proportion of normal cells in medical research. It is noteworthy that compositional data are subject to the following two intrinsic constraints: (a) (Bounded support constraint) Each element in the component vector must lie between 0 and 1, inclusive; and (b) (Summation constraint) All the elements in the component vector must sum to 1. Mosimann (1962) proposed the Dirichlet-multinomial (DM) distribution, which is a family of discrete multivariate probability distributions on a finite support of nonnegative integers. It is noteworthy that DM is a compound probability distribution, which models counts from a multinomial distribution with a probability vector that is drawn from a Dirichlet distribution. As a result, the DM model (i.e., zero-inflated generalized DM [ZIGDM] by Tang & Chen, 2019) is designed for count data and fails to deal with the compositional data, which are proportions or percentages, described in this article. Applications of statistical methods designed for unconstrained data to such compositional data may result in invalid inference conclusions. For instance, Pearson (1897) discussed the spurious correlation issue in compositional data analysis and concluded that the unit-sum constraint is often intentionally ignored and the statistical methods without constraints are misused, which may eventually lead to disastrous results. Aitchison (1982) first proposed statistical methodology for the compositional data. Aitchison (1982Aitchison ( , 1986 first introduced the logistic normal (LN) distribution as a framework for compositional data analyses. In particular, his technique assumes multivariate normality of additive log-ratio transformed data. Since then, various researchers have extended Aitchison's approach in both theoretical and practical respects. For example, Zhang (2000) discussed various distributions for compositional data on the simplex district (e.g., the generalized Dirichlet, additive logistic, and spherical distributions). Egozue et al. (2003) introduced the isometric log-ratio transformation.
In compositional data analysis, the presence of zero components may induce obstacles to the applications of the aforementioned distributional approaches (e.g., zero cannot be the denominator when applying the additive logistic transformation). Aitchison (1986) classified the zeros in compositional data into rounded (or trace elements) zeros and essential (or true) zeros. It is not uncommon that compositional data contain zero components due to either complete absence (i.e., essential zeros), or a small proportion or below the detection limit (i.e., rounded zeros) of certain component(s). Aitchison (1982) pointed out that the log-ratio transformation failed to work when these zeros are denominators.
To deal with the rounded zeros, the most popular method is to replace the rounded zero(s) by a small value (i.e., zero replacement). For example, Palarea-Albaladejo and Martín-Fernández (2008) proposed a modified EM algorithm to replace the rounded zeros in compositional data, Hijazi (2011) developed the EM algorithm-based method to deal with rounded zeros. The nonparametric imputation approach is proposed by Martín-Fernández et al. (2003) to handle rounded zeros.
For essential zeros, three well-known approaches have been developed. The data amalgamation, which was proposed by Aitchison (1990), is to eliminate the components with zero elements by combining them with some other nonzero components. The second approach models the zeros separately (e.g., Aitchison, 1986;Bear & Billheimer, 2016;Zadora et al., 2010). For instance, Bear and Billheimer (2016) projected compositions with zeros onto smaller dimensional subspaces. As a result, they developed a mixture of logistic normals which successfully addresses the issues of division by zero and the log of zero. The third approach is to transform compositions into directional data on the hypersphere and develop a regression model using the Kent distribution (e.g., Kent, 1982;Scealy & Welsh, 2011), which tolerates zeros. Other methods are also investigated, such as the mixture models to eliminate the essential zeros (Stewart & Field, 2011), the latent Gaussian model (Butler & Glasbey, 2008), and the Dirichlet regression model (Tsagris & Stewart, 2018).
In clinical research, compositional data with essential zeros are not uncommon. For instance, chromosome abnormalities are considered to be the most common cause of spontaneous abortion. Fluorescence in situ hybridization (FISH) is a cytogenetic technique developed in the early 1980s (see, e.g., Langer-Safer et al., 1982). It uses fluorescent DNA probes to target specific chromosomal locations within the nucleus, resulting in colored signals that can be detected using a fluorescent microscope. For spontaneous abortion, a damaged embryo is taken out from the gravida and the FISH technique is then employed to detect the cells which are selected randomly from the damaged embryo. Finally, the respective proportions of diploidy, triploidy, and polyploidy at chromosome 22 for those randomly chosen and tested cells are recorded for each embryo. Obviously, the observations are compositional data (i.e., total sum is equal to one). For example, an observation of (0.2,0.3,0.5) means 20%, 30%, and 30% of the selected cells are chromosome diploidy, chromosome triploidy, and chromosome polyploidy, respectively. The FISH data reported in the Supporting Information are the compositional observations of 51 embryos from the curettage operation in Zhongshan People's Hospital in Mainland China. The age of each gravida is also reported. It is noteworthy that nearly 80% (i.e., 40 out of 51) of the embryos demonstrate purely normal chromosomes (i.e., compositional observation being (1, 0, 0)). Most importantly, none of the aforementioned approaches are suitable for our FISH data, which motivate the present article.
The rest of this paper is organized as follows. In Section 2, we introduce a new stochastic representation (SR) for compositional data with zero components and the new Dirichlet composition distribution (DCD) is defined. Likelihood-based methods for parameter estimation and confidence intervals construction without covariates will be provided in Section 3. Regression model analysis based on the distribution will be considered in Section 4. Simulation studies will be conducted to examine the performance of our proposed methods in Section 5. We will revisit and analyze the FISH dataset in Section 6. A brief discussion will be presented in Section 7. Some technical details are included in the Appendix.

NEW DEFINITION OF COMPOSITIONAL RANDOM VECTOR AND THE DCD
In this section, we introduce a new definition of a compositional random vector which can be adopted for modeling the compositional data. The definition is proposed based on SR. We then introduce the DCD by assuming the base vector following independent Gamma distributions.

Definition of a compositional random vector
To model the zero elements in the compositional data, we employ the SR to establish the definition of a compositional random vector. The indicator vector provides the possibility of zero entries in the distribution with = 0 meaning the th component in being zero. The base vector carries the quantitative information. It can be any positive vector and determines the nonzero components in .

Definition of DCD
In the compositional random vector, if we let be the -dimensional independent Bernoulli random variables by excluding the point , and be the independent Gamma random variable with different shape parameters but identical rate parameter , we can define a new distribution called DCD. Since the rate parameter will be eliminated in the SR of the compositional random vector, is unidentifiable in the distribution. Without loss of generality, we assume = 1. That is,
The probability density function of is then given by where = ( > 0), = 1, … , and * = ∑ ∈ , is the subset of the index with being positive (i.e., > 0 for any ∈ and = 0 for ∉ ). (For more details of the probability density function, refer to Appendix A.1.) Remark 1. It is clear that ≠ 1 for = 1, … , . As Pr( = 0) = = 1 implies that the element in the th column must be 0, we can simply delete the th column and the remaining − 1 columns still form a compositional random vector.
Remark 3. We here suppose follows the zero-truncated multivariate Bernoulli distribution. Due to SR in (1), the denom- must be nonzero; therefore, 1 , … , are not independent as they cannot be 0 at the same time.
That is, { } =1 follow the independent Bernoulli distributions but exclude the point .

Statistical inference without covariates
In this section, we present statistical inferences based on data without covariates, which include the maximum likelihood estimates (MLEs) calculation in Section 3.1 and the confidence interval construction in Section 3.2.

Maximum likelihood estimates for target parameters
Suppose the observations are { 1 , … , }, where = ( 1 , … , ) ⊤ and is the number of dimensions. Without loss of information, we assume that there are zero entries in the first observations, that is min( ) = 0 for = 1, … , , and min( ) > 0 for = + 1, … , , 0 ≤ ≤ . We have = (1, 2, … , ) ⊤ when ≥ . Therefore, the observed likelihood function is given by where denotes the index set of those positive elements in each (i.e., > 0 if ∈ ). Here, the indicator variable can be observed via the observation as Observing that = 1 is equivalent to ∈ . we have Instead of obtaining the MLEs of the parameters = ( 1 , … , ) ⊤ and = ( 1 , … , ) ⊤ via solving the solutions to the system of equations ( , | obs ) = and ( , | obs ) = , we consider the EM algorithm. Motivated by the SR, we introduce the base vectors { } =1 and as missing data, where denotes the number of unobserved to make the components in being independent. In fact, are independent Bernoulli variables which exclude the outcome of . Therefore, { , … , ⏟ ⏟ ⏟ , { } =1 } are the complete observations and the likelihood function based on the complete data is given by and the log-likelihood function of the complete data likelihood function is The M-step is to solve the following equations, for = 1, … , ∶ where denotes the digamma function with ( ) = log(Γ( )) = However, there are no closed-form solutions for s and we will use the following Newton-Raphson iterative algorithm to find the MLEs of s. where To obtain the E step, we have the following theorem and the proof is presented in Appendix A.2.
Theorem 1. The conditional expectation of log given is as follows: The E-step is to replace the missing data by the following conditional expectations: Here, we can consider the initial values of parameters being 0 = (1, … , 1) ⊤ in the EM algorithm. The above steps (i.e., E-and M-steps) are repeated until a certain convergence condition is achieved. For instance, if the difference between two successive log-likelihood values is less than the prespecified value 0.001, the algorithm stops after 100-150 iterations.

Confidence interval construction
In this section, we will consider the construction of confidence intervals for target parameters using the bootstrap method. It is noted that the value of must be restricted within the interval [0,1]. However, Wald-type confidence intervals may produce upper (or lower) limit larger (or less) than 1 (or 0). It is noteworthy that the MLE of obtained via our proposed EM algorithm always lies between 0 and 1. As a result, we apply the bootstrap method to create the bootstrap confidence interval (CI) for any arbitrary function of = ( , ), denoted by = ℎ( ). Briefly, based on the observations, we can independently generate { } =1 with each { } is randomly selected from the observations with replacement. Having obtained * obs = { * 1 , … , * }, we can calculate the parameter estimateŝ * and get the bootstrap replication̂ * = ℎ(̂ * ). Independently repeating this process times, we obtain replications {̂ * } =1 . The bootstrap CI of can be constructed by [ L , U ], where L and U are the 100( /2) and 100(1 − /2) percentiles of {̂ * } =1 , respectively.

STATISTICAL INFERENCE WITH COVARIATES
In this section, we will show how to formulate the regression model for the target parameters and how to obtain the MLEs of the coefficients in the regression model. Let the covariates of each observation be denoted by { , }, = 1, … , . We consider the following regression models: Let denote the number of supplementary to make the elements in being independent, where = 1, … , . Obviously, 1 , … , are missing data with = 1 being equivalent to 1 = ⋯ = 0 and Pr( = 1) = 1 ⋯ . Thus, the complete likelihood function is given by Or, the log-likelihood function is ] .
The MLEs of the regression coefficients are the solution to the following equations: It is obvious that there is no closed-form solution to (21). Here, we use the Newton-Raphson algorithm to calculate the MLEs, and the iterations are given by The first and negative second partial derivatives of the complete-data log-likelihood function are given by Here, com ( ) and com ( ) are actually the complete-data Fisher information matrices associated with the parameter vectors and , respectively, which depend on neither the observed data nor the latent/missing data.
To obtain the MLEs of the parameter vectors and in the presence of missing data (i.e., 1 , … , ), we introduce the EM algorithm. Briefly, the M-step is to separately calculate the MLEs of and via Newton-Raphson algorithms as follows: , and The E-step is to replace in (25) by their conditional expectations, that is, (log | obs , , ) Remark 4. The calculation of coefficients usually works well when the dimension is not large. However, the Newton-Raphson algorithm may fail to work when the dimension is high due to the Jacobian (i.e., ( )) tending to be 0 in some iterations. Therefore, studies with a large number of covariates should be carefully handled in order to get reliable estimates. This will be an interesting and practical topic for future research.

HYPOTHESIS TEST
We are usually interested in whether some of the coefficients/parameters are equal to zero. In this section, we will consider the likelihood ratio test (LRT) for the following hypotheses: where 1 , … , satisfy 1 ≤ 1 < ⋯ < ≤ 1 , 1 ≤ 1 < ⋯ < ≤ 2 , and 1 and 2 are the number of covariates related to and , respectively. The LRT statistic is then given by where (̂0,̂0) are the constrained MLEs of ( , ) under 0 and̂,̂are the unconstrained MLE of ( , ). Under the null hypothesis 0 , the -value is given by where is the observed value of and 2 ( ) is the chi-square distribution with = + degrees of freedom.

SIMULATION STUDIES
To evaluate the performance of the proposed statistical methods of DCD, we first investigate the accuracy of point estimates and confidence interval estimates for different parameter settings via simulation studies. We then conduct simulation studies for the regression model. The MLEs of parameters, standard deviation, and confidence intervals are presented. We will compare the ZIGDM model proposed by Tang and Chen (2019) with our proposed DCD model. Finally, simulation results for hypothesis testing are presented. In this section, all statistical computations are implemented in R.

Accuracy of point and interval estimates
For the -dimensional compositional data, there are 2 parameters in the DCD (i.e., the -dimensional parameter = ( 1 , … ) ⊤ and -dimensional parameter = ( 1 , … ) ⊤ ). We consider two cases, = 2 and = 3, to evaluate the accuracy of point and confidence interval estimates. When = 2, we set ( , ) = (0.1, 0.2, 3, 4) or (0.1,0.4,2,1). When = 3, we set ( , ) = (0.1, 0.3, 0.2, 3, 2, 4) or (0.2,0.2,0.3,2,1,3). For each parameter configuration, we generate { } =1 ∼ DCD( , ) with = 100, 300, 500, and calculate the MLEs via the EM algorithm and the 95% bootstrap CIs with a significance level = 0.05 with = 1000. The MLEs of parameters, the width, and coverage probability of the bootstrap confidence interval are presented in Tables 1 and 2 for = 2 and = 3, respectively. From Tables 1 and 2, it is clear that the performance of the MLEs is satisfactory in the sense that (i) the bias of the estimate is negligible; (ii) the confidence width is acceptable; and (iii) the coverage probability is from 0.923 to 0.966, which is not far from the prespecified value 1 − 0.05 = 0.95. Though 0.923 is a little far from 0.95, the coverage proportion can be improved by increasing the sample size.

Numerical results for the regression model
In this subsection, we conduct simulation to evaluate the performance of the proposed regression model for target parameters. Here, we set = 3 and the regression coefficient vector is ( 1 , 2 , 3 , 1 , 2 , 3 ) with the true values being set and we calculate the MLEs (̂1,̂2,̂3,̂1,̂2,̂3) and this process is repeated 1000 times. The mean value, standard deviation and bootstrap confidence interval are presented in Table 3. According to the results, the MLEs and bootstrap confidence intervals perform well.
As the data are generated from the DM model, 1 is expected to be less than 2 . From Table 4, 1 is generally less than 2 . It is noticed that the DCD model sometimes fits better than the ZIDGM model. It suggests that our proposed DCD model is robust.
We generate data with parameters being 1 = 2 = 3 = 1 = 2 = 3 = . Applying the LRT in Section 5, we record the proportions of rejection of the above three cases with the sample size being = 50, = 100, = 200, = 500, = 1000. The simulated type I error rate is reported in Table 6. It is clearly that the performance of our LRT is fairly good even when the sample size is small. When the sample size is larger than 200, the simulated type I error rate is close to the prespecified significance level (i.e., 0.05). Next, we investigate the power of the LRT. We generate the data from { } =1 ∼ DCD( , ) with parameters being ( , ′ 2 , , , 2 , ). The number of rejections according to the above three cases is recorded in Table 7. As expected, the simulated power increases with the sample size.

THE PERCENTAGE OF CHROMOSOME DATA BY THE FISH TEST
In this section, we revisit the FISH test dataset described in Section 1. We here apply the DCD to analyze the FISH data of chromosome 16. First, we analyze the dataset with no covariate, and the results are reported in Table 8. As we can see from Table 8, the first component in the composition dataset (i.e., the normal cell) is all nonzeros,̂1 = 0 means that the probability of zero observation for the normal cell is 0. The probability of zero observations of the triple and tetraploid cell is estimated to be 0.784 and 0.980, respectively. That is, for the chromosome 16 the triple cell can be found with 20% percentage while tetraploid can be found with only 2% percentage. For the base part of the compositional data, the estimate corresponding to the triple cell is larger than those of the normal and tetraploid cells; that is,̂1 = 12.17, Due to the assumption of the normal distribution for covariates in the regression model, we make the standardization for covariate age in the FISH dataset. Furthermore, we apply the regression model to investigate the relationship between parameters ( , ) and the age of gravida. The MLEs of the coefficients are reported in Table 9. We apply the LRT to the hypotheses listed in Section 5, and the null hypothesis is rejected at = 0.05. Therefore, we have reason to believe that age is a significant variable.

DISCUSSION
In this article, we consider a new framework for analyzing compositional data with zero entries based on SR. In particular, a new distribution, namely the DCD, is developed to accommodate the possible essential-zero feature in compositional data (i.e., some components are completely absent). In our proposed distribution, the elements in the base vector are assumed to follow independent gamma distributions. Therefore, any positive random variable can be adopted as an element of the base vector (e.g., the inverse Gaussian and chi-square random variables), and different base vectors will correspond to a different relationship among elements. It is of research and practical interests to relax the assumption of independence among the components in the base vector. Besides, regression modeling for high-dimensional covariates is also an interesting and necessary topic as the Jacobi tends to be zero when the dimension of covariates is high.

A C K N O W L E D G M E N T S
The probability density function of ( , −1 ) given is then given by