Empirical null estimation using zero‐inflated discrete mixture distributions and its application to protein domain data

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene‐centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large‐scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero‐inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero‐inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut‐off value such that smaller counts than this value are generated from the null distribution. We present several data‐dependent methods to determine the cut‐off value. We also consider a two‐stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two‐stage testing procedure has superior empirical power.


Introduction
Interest towards multiple testing procedures has been growing rapidly in the advent of the so-called genomic age. With the breakthrough in large-scale methods to purify, identify, and characterize DNA, RNA, proteins, and other molecules, researchers are becoming increasingly reliant on statistical methods for determining the significance of biological findings (Pollard et al., 2005). Gene-based analyses of cancer data are classic examples of studies which present thousands of genes for simultaneous hypothesis testing. However, Nehrt et al. (2012) reported that gene-centric cancer studies are limited since the functional context that the position of the mutation provides is not considered. In lieu of this, Nehrt et al. (2012) have shown that protein domain level analyses of cancer somatic variants can identify functionally relevant somatic mutations where traditional gene-centric methods fail by focusing on protein domain regions within genes, leveraging the modularity and polyfunctionality of genes.
In protein domain-centric studies, somatic mutations from sequenced tumor samples are mapped from their genomic positions to positions within protein domains, enabling the comparison of distant genomic regions that share similar structure and amino acid composition (Peterson et al., 2013). In the analysis of sequenced tumor samples, it is assumed that the mutational distribution will consist of many "passenger" mutations, which are non-functional randomly distributed background mutations, in addition to rare functional "driver" mutations that reoccur at specific sites within the domain and contribute to the initiation or progression of cancer (Parmigiani et al., 2007). The major interest is in a single domain, how to identify the highly mutated positions compared to the background where the number of positions in a domain can be as large as several hundreds.
Motivated by the aforementioned domain-level analyses, we propose a methodology for identifying significant mutation counts while controlling the rate of false rejections. Dudoit et al. (2003) reported that much of the statistics microarray literature is focused on controlling the probability of a Type I error, a "false discovery." A traditional approach is to control the family-wise error rate (FWER), the probability of making at least one false discovery. However, with the collection of simultaneous hypothesis tests in the hundreds or thousands, trying to limit the probability of even a single false discovery leads to lack of power. Alternatively, in a seminal paper, Benjamini and Hochberg (1995) introduced a multiple hypothesis testing error measure called False Discovery Rate (FDR). This quantity is the expected proportion of false positive findings among all the rejected hypotheses. Among the FDR-controlling test methods, Efron et al. (2001) developed an empirical Bayes approach where they established a close connection between the estimated posterior probabilities and a local version of the FDR.
A key step in controlling the local false discoveries is to estimate the null distribution of the test statistics. Efron (2004) stated that the test statistics in large-scale testing may not accurately follow the theoretical null distribution. Instead, the density of the null distribution is estimated from the large number of genes. In these microarray experiments, Efron (2005) employed a normal mixture model and proposed maximum likelihood and mode matching to estimate the empirical null distribution. Park et al. (2011) proposed a local FDR estimation procedure based on modeling the null distribution with a mixture of normal distributions. However, these existing methods are based on the assumption that the null is a mixture of continuous distributions. In the case of domain-level analyses, the data is characterized as mutation counts among N positions in the domain. This indicates that the available methods in the estimation of the empirical null should be extended to a mixture of discrete distributions.
The rest of the article is organized as follows: In Section 2, we discuss the problem in detail and review two existing multiple testing procedures, namely Efron's Local FDR procedure and Storey's procedure. In Section 3, we introduce the estimation procedure for f 0 , f , and π 0 , where the null distribution is assumed to be a zero-inflated model. Also, a novel two-stage multiple testing procedure is presented in this section. In Section 4, the performance of the new procedure is studied via simulations and the results for real data sets are presented. Some concluding remarks will be presented in Section 5.

Multiple Testing Procedures Controlling FDR
In this section, we briefly discuss the motivating example and review the existing procedures for analysis. The collection of the original dataset is a = (a 1 , a 2 , . . . , a N ) , where a i is the number of mutations in the ith position of the specific domain with N positions. We define A as the set of the unique values of a, K = max(a), and L is the cardinality of A where L ≤ K + 1. Some relevant features of a follow. A large proportion of positions do not have any mutation, a i = 0. Also, L is relatively small compared to N, which means that the number of mutations in many positions are tied. Since our goal is to identify the positions with extra disease mutation counts, it is only reasonable to have the same conclusion for positions wherein the number of mutations are tied. Therefore, we transform the data into the observed "histograph" of positions over "mutation counts." We define n j =| {i : a i = j} |, as the number of positions with j mutations, j ∈ A, where A ≡ {j : j ≥ 0, n j > 0} and j≤K n j = N. The ordered data x N can be represented as a partition of the unique values of a, that is, where x j is the column vector containing n j of js.
For any single domain of interest, a total of L mutation counts can be decomposed into two groups, A 0 and A 1 , where A 0 is the collection of small number of mutation counts which is considered to be non-significant and A 1 is the set of large number of mutation counts which consists of significantly mutated positions. Let the prior probabilities of the two groups be π 0 or π 1 = 1 − π 0 , and assume corresponding densities, f 0 or f 1 . Define f 0 to be the null distribution and f 1 to be the alternative distribution. Therefore, we consider the problem of testing L null hypotheses simultaneously, H 0j is true for j ∈ A, |A| = L, where H 0j is stated as the number of mutations j is generated from f 0 .
For a given position, the number of mutations follow one of the two distributions f 0 or f 1 , so the probability density function of the mixture distribution can be represented as and our goal is to identify the positions which have significantly different patterns from the null. For continuous data, Efron (2005) introduced the idea of "zero assumption" where observations around the central peak of the distribution consists mainly of null cases. Using this assumption, f 0 is estimated using Gaussian quadrature which is based on derivative at the mode. However, such a procedure is not applicable to discrete data. In our problem on discrete data, we introduce the following assumption on the null distribution which plays a key role throughout this article.
Assumption on f 0 : From the assumption, a i ≤ C are guaranteed to be from f 0 and a i > C are generated from the mixture of f 0 and f 1 . We will discuss more details about how to choose the value of C in the next section. We do not have the issue of identifiability in the mixture model (1) since f 1 has a different support from that of f 0 from the assumption (2), so any parametric model for f 0 considered in our article is uniquely identified. Benjamini and Hochberg (1995) employed a sequential p-value method to determine r that tells us to reject p (1) , p (2) , . . . , p (r) , where p (1) , p (2) , . . . , p (K) are the ordered observed p-values. Storey (2002) improved the Benjamini-Hochberg (BH) procedure with the inclusion of the estimator of the null proportion,π 0 , which indicates that we reject The BH procedure and Storey's procedure are equivalent, that is r = , if we takeπ 0 = 1. The details about the estimation of π 0 is provided in the next section. Moreover, following Efron (2007) we define the local FDR at any mutation count, say t, as which indicates that fdr(t) is the posterior probability of a true null hypothesis at t. The interpretation of the local FDR value is analogous to the frequentist's p-value wherein local FDR values less than a specified level of significance provide stronger evidence against the null hypothesis.

Model Specification
Depending on the application, we assume that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. For example, the zero-inflation observed in protein domain data is due to the negative selection of mutations that can not occur in either a healthy cell or a cancer cell. Biologically, zero-inflation is expected due to of the many protein positions that are crucial for proper function, often leading to a cell that can not perform the functions necessary for life. The class of models considered is the Generalized Poisson (GP) distribution introduced by Consul and Jain (1970), with an additional zero-inflation parameter. Let T be a nonnegative integer-valued random variable where relative to Poisson model, it is overdispersed with variance to mean ratio exceeding 1. If T ∼ GP(λ, θ), then the probability mass function can be written as where 0 ≤ θ < 1 and λ > 0. If zero is observed with a significantly higher frequency, we can include a zero-inflation parameter in (4) to characterize the distribution. Then X ∼ ZIGP (η, λ, θ) and the probability that X = j, denoted by f 0 (j), is where j is a nonnegative integer, 0 ≤ η < 1, 0 ≤ θ < 1 and λ > 0. The indicator function I S (j) is equal to 1 if j ∈ S and 0 otherwise. Recently, ZIGP models have been found useful for the analysis of heavy-tailed count data with a large proportion of zeros (Famoye and Singh (2006)). The ZIGP model reduces to Zero-Inflated Poisson (ZIP) distribution when θ = 0, Generalized Poisson distribution (GP) when η = 0 and Poisson distribution when η = 0 and θ = 0. The ZIP model, first introduced by Lambert (1992), is applied when the count data possess the equality of mean and variance property while taking into consideration the structural zeros and zeros which exist by chance. Meanwhile, the Zero-Inflated Negative Binomial (ZINB) model is widely used for handling data with population heterogeneity which may be caused by the occurrence of excess zeros and the overdispersion due to unobserved heterogeneity (Phang and Loh, 2013). The rationale for choosing ZIGP as a model for f 0 was provided by Joe and Zhu (2005). They showed that ZIGP provides a better fit than ZINB when there is a large fraction of zeros and the data is heavily right-skewed. They compared the probabilistic properties of the zero-inflated variations of NB and GP distributions, such as probability mass and skewness, while keeping the first two moments fixed. Using this result, it is worthwhile to consider ZIGP rather than ZINB given that the mutation count data exhibited both features.
3.2. Estimation of f 0 , f , and π 0 From equation (3), the local FDR formulation consists of unknown quantities f 0 , f , and π 0 which must be estimated accordingly. We follow the idea of "zero assumption" in Efron (2005) where f 0 is assumed to be normally distributed and Park et al. (2011) which modeled f 0 as a mixture of normal distributions.
However, since f 0 is unknown in practice, four count models will be compared in order to come up with estimates for the parameters of the null distribution. These models belong to the class of ZIGP distribution, namely, (i) ZIGP (ii) ZIP (iii) Generalized Poisson, and (iv) Poisson distribution. If the true f 0 is ZIGP and the model used to estimate f 0 is ZIGP then we expect superior results compared to the other three distributions. Moreover, if the true null distribution is ZIP, then we expect better results for ZIP and ZIGP distribution compared to GP and Poisson distribution. This suggests that since ZIGP can characterize overdispersion, even if there is none such as the case of ZIP, it should still be able to capture the behavior of f 0 accurately.
To estimate the parameters of f 0 for any of these four count models, the EM Algorithm proposed by McLachlan and Jones (1988) will be utilized. For truncated data sets described in (2), fitting the model using EM algorithm is not straightforward as when all data points are available. The details of the EM Algorithm are provided in Web Appendix A.
Moreover, it is straightforward to estimate f (j) by using relative frequency given byf (j) = n j /N. Using the assumption on f 0 , for j ≤ C, f (j) from (1) reduces to π 0 f 0 (j).

Choice of the Cut-off C
In our model, we assume that we can identify a cut-off C, wherein positions with number of mutations greater than C contain more mutations than what would be expected in the null model. The choice of the cut-off C is of paramount importance since the estimation of f 0 and π 0 depend on C. It is more realistic to assume that C is unknown, so such a predetermined C may affect the result of local FDR procedure seriously. In particular, if C is predetermined and is chosen to be larger than the true value, the null distribution is estimated based on observations from alternative hypothesis as well as null hypothesis, so the estimated null distribution is contaminated by the alternative distribution. This will cause insensitivity of local FDR procedure in detecting the alternative hypothesis. On the other hand, if C is chosen to be smaller, then the null distribution is estimated only based on small values, so the estimation of the null distribution especially at the tail part is less reliable. Empirically, the FDR procedure yields liberal results in that there are too many rejections resulting in failure in controlling a given level of FDR.
The estimation of the cut-off C has been formulated using the likelihood function. Define the index sets We adopt the idea of sequential testing to detect the change point in which the observations are generated from the mixture distribution f . More specifically, suppose we observed (0, n 0 ), (1, n 2 ), . . . , (K, n K ) sequentially from Our goal is to detect the change point C based on assuming that we observe 0, 1, 2, . . . , K sequentially. For a given ν, we define S ν ( , f ) as . Since the parameters is estimated from EM algorithm andf (j) = n j N , our procedure iŝ whereˆ ν is the estimator from the EM algorithm with the value of C set to ν. One may consider the full likelihood of all observations and find out some connection between S ν and the full likelihood presented as follows: Sheetlin et al. (2011) offer an objective-change point method that can replace the subjective approaches performed by eye-balling the data. Their proposed method resembles the change-point regression and robust regression but it is tailored to estimate the change point from a transient to an asymptotic regime. Given a tuning parameter c and a criterion function ρ, depending on β, the estimator for the change point k is defined as where ρ(e i ) is the estimated least-squares normalized residual.
In (5), there is a tuning parameter c which should be given ahead. The value of c plays the role of penalty for adding terms ρ(e i ) in (5), so the predetermined value of c affects k arbitrarily. We see that our proposed estimation of C is related to the form (5). We estimate C viâ whereˆ ν = (π 0,ν ,η ν ,θ ν ,λ ν ) is obtained from the EM algorithm discussed in the previous section,f (j) = n j /N, In (5), c is a predetermined value, however we don't need to predetermine any parameter in (6). The proposed criterion (6) is related to the penalized model selection such as AIC and BIC. When we use the information that n = j≤C n j observed values are generated there is a compromise term c = − log π 0 for each observation to compensate adding additional terms. There is a total of N ν positions, so when we use the assumption ν = C, we consider N ν log π 0 penalty to the log likelihood function ν . Most of well known model selection criteria have similar forms where the penalty terms are related to penalize the complexity of models. In our context, the term − log π 0 gives penalty to using the information that j for j ≤ ν are generated from f 0 . For a small value of π 0 , the corresponding penalty (− log π 0 ) is large since a large penalty should be given to a low chance of f 0 . On the other hand, if π 0 is close to 1, there becomes small risk from assuming observations are from the null hypothesis.
For the second method, we consider the extension of the methodology proposed by Efron (2007) which explicitly uses the zero assumption. This stipulates that the non-null density f 1 is supported outside some set {0, 1, . . . , C}. The likelihood function for The cut-off can be computed aŝ

Modification of Local FDR by Truncation
In practice, if a given domain position has a large number of mutations, then these mutations are expected to be significant. In many cases, there are relatively few positions in a protein domain where large values of mutations can be observed. This indicates that for large values of j, estimation of f based on relative frequency is not accurate due to the sparse data in the tail part. Consequently, the estimated local FDR is not reliable since it depends on the estimator of f . Rather than testing significance based on inaccurate local FDRs from large mutation counts, we consider a screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Such a critical value will be decided depending on the estimated null distribution. When we have observations a i for 1 ≤ i ≤ N generated from the null distribution, we are interested in figuring out D N such that hardly observed under the null hypothesis, so the corresponding null hypothesis is rejected directly rather than making decision based on local FDR procedure. There are many choices of D N , but a smaller sequence of D N satisfying (8) is of our interest since any sequence B N satisfying B N > D N also satisfies the property. The details of the calculation of D N are provided in Web Appendix B. The calculation of D N when f 0 is modeled using ZIGP or Generalized Poisson is similar since the derivation will eventually yield leading terms which does not involve η. Likewise, the calculation of D N when f 0 is modeled using either ZIP or Poisson is the same. On the other hand, since the true values of the parameters λ and θ are unknown, we calculate D N using the estimates λ and θ. Hence, D N can be calculated as where x is the smallest integer greater than or equal to x(x > 0). D 1 and D 2 are presented in Web Supplementary Materials (Web Appendix B) where D 1 is a function ofλ and θ and D 2 is a function ofθ when f 0 is modeled using ZIGP or Generalized Poisson distribution. When f 0 is modeled using ZIP or Poisson distribution, D 1 is a function of λ only while D 2 does not depend on any parameter estimate. Finally, the proposed Two-Stage procedure can be summarized into two stages:

Two-Stage Procedure
(1) Stage 1(Screening Step): Based onĈ (eitherĈ 1 in equation (6) orĈ 2 in equation (7), estimate Ĉ and computeD N in (9). To incorporate this condition in the formulation of D N , we have We reject the H 0j if j ≥ D N (Ĉ) and do not reject if j ≤Ĉ.

Simulation Studies
To gain insights regarding the robustness of the proposed procedures in the presence of model misspecification, we perform some simulation studies. The comparison is based on four simulation boundaries: (i) model for the estimation of f 0 ; (ii) method used in the choice of the cut-off C; (iii) true null distribution; and (iv) non-null distribution used in data generation. There are four models compared for the estimation of f 0 as discussed in Section 3.2. Also, there are two methods presented in Section 3.3 for the choice of cut-off C. The true null distributions considered are Zero-Inflated Poisson (ZIP) and Zero-Inflated Generalized Poisson (ZIGP) distribution. Both distributions account for the excessive number of zeros which is a characteristic of the mutation count data. Following the key assumption on f 0 , the support of f 1 does not contain values in [0, C]. Hence, f 1 can be expressed as f 1 = C + 1 + W where W follows another count model. For the model specification of W, Geometric(p = 0.08) and Binomial(n = 250, p = 0.20) distribution are utilized. They exhibit the pattern of the mutation count observed in the real data set.
Using these model specifications in terms of the true f 0 and f 1 , there are 15 mixture models considered for data generation as presented in Web Table S1. For each of the specification of f 0 ,Ĉ is calculated and the corresponding set of parameter estimates Ĉ = ( ηĈ, λĈ, θĈ, π 0,Ĉ ) from EM Algorithm are obtained. Results show that regardless of the true null distribution, if f 0 is modeled using the ZIGP distribution, thenĈ produces the most accurate estimate for C in terms of the bias and standard error. This validates the robustness of ZIGP as a model for f 0 , that is, even if the true null distribution is ZIP, the most accurate estimate ofĈ can still be observed when f 0 is modeled using ZIGP. The bias of the parameter estimates for each of these specifications are presented in Web Tables S2-S5 in the Supplementary Materials.
A total of L hypotheses tests were performed for independent random variables n j over 1000 replications. For each replication, the proportion of n j from the null distribution is set to be π 0 and the total number of positions N is specified to be 1000. To calculate the False Discovery Rate, FDR, for the kth generated data, k = 1, 2, . . . , 1000, we compute the False Discovery Proportion (FDP) which is defined by where V k and R k are the number of falsely rejected hypotheses (false discoveries) and the total number of rejected hypotheses in the kth generated data, respectively. FDR is the expected value of the false discovery proportion and can be computed empirically as In our simulations, the decision rule is to reject the null H 0j if fdr(j) =π 0f0 (j)/f (j) ≤ α. Throughout the simulations, we consider the level of significance α = 0.05. The True Positive Rate, TPR is computed empirically as where S k and T k are the number of correctly rejected hypotheses (true discoveries) and the number of falsely accepted hypotheses (false non-discoveries) in the kth generated data, respectively. Three procedures are compared in terms of controlling FDR and the value of TPR, namely the one-stage local FDR procedure, the proposed two-stage procedure and Storey's procedure. The results which yields the superior TPR while controlling FDR are highlighted using bold face quantities.
As displayed in Figure 1a-c, the non-null distribution is Geometric, the proportion of null cases is 0.80 and the fraction of zeros is 0.80. The degree to which the null model is mixed with the non-null model is described using the three cases. ZIP 1 represents the well-separated case, ZIGP 1 is the moderately mixed case while ZIGP 2 can be described as the heavily mixed case. The corresponding numerical comparison is shown in Table 1.
Based upon the results of Table 1, since null and non-null distribution is moderately mixed for ZIGP 1 , the resulting TPR for all three procedures is substantially higher than the TPR for ZIGP 2 , regardless of the model used for the estimation of f 0 . Given that FDR is controlled in all procedures, if the model for f 0 is ZIGP, then the Two-Stage procedure yields the highest TPR.
This suggests that the proposed procedure is better than the other existing procedures. Meanwhile, due to the "wellseparation" if the true null is ZIP 1 , then the TPR for ZIP 1 is slightly higher than the TPR for ZIGP 1 . Moreover, the FDR for all three procedures for ZIP 1 are noticeably lower than the FDR for ZIGP 1 . This means that the number of rejections for ZIGP 1 and ZIP 1 are almost the same but there are more erroneous rejections for ZIGP 1 . This result can be explained by the presence of overdispersion in ZIGP 1 , thereby suggesting that the presence of overdispersion in the data could lead to erroneous rejections. Overall, there are more rejections usinĝ C 1 as a cut-off compared toĈ 2 . Also, even ifĈ 1 yields more rejections, it still controls the value of FDR demonstrating the superiority ofĈ 1 as a cut-off method.
Figure 1d-f also presents the histograms when the nonnull distribution is Binomial, the proportion of null cases is 0.35 and the fraction of zeros is 0.40. Unlike the parametrization of the Geometric non-null distribution which appears to be skewed to the right, this f 1 exhibits near symmetry. In terms of the mixing of the null and non-null distribution, ZIP 2 represents the well-separated case, ZIGP 3 is the moderately mixed case while ZIGP 4 can be described as the heavily mixed case. The numerical comparison is shown in Table 2.
According to Table 2, when f 0 is modeled using ZIGP, usingĈ 1 as a cut-off yielded many more rejections, regardless of the procedure used. This suggests that the extension of Efron's method is conservative and would miss significant positions. The difference betweenĈ 1 andĈ 2 is further highlighted for ZIGP 4 , where the true null distribution is heavily mixed with the non-null distribution and overdispersion is present. For scenarios where FDR is controlled for ZIP 2 and ZIGP 3 , using Two-Stage procedure leads to the highest TPR. However, for ZIGP 4 where overdispersion is evident and f 0 is heavily mixed with f 1 , the value of TPR is substantially higher using Storey's procedure, while keeping the FDR controlled.
Another scenario considered is when the true non-null distribution is Geometric, the proportion of null cases is 0.85 but the fraction of zeros is 0.40. Unlike the scenario presented in Table 1 and Figure 1, the specified proportion of zeros is reduced to half. The interest is to determine whether there would be a change in pattern should there be a significant decrease in the number of positions without a mutation. The histograms and the corresponding numerical comparison are presented in Web Figure S1 and Web Table S6 in the Supplementary section. It can be noted that regardless of the magnitude of the fraction of zeros, a similar pattern can be observed in terms of the superiority ofĈ 1 as a method for choosing C. Among the procedures, the TPR is consistently highest for the Two-Stage procedure, given that the FDR is

Table 2
Numerical Comparison when f 1 is shifted Binomial(n = 250, p = 0.20), π 0 = 0.35 and C = 5. ZIP 2 (η = 0.40, λ = 1.5) is the well-separated case, ZIGP 3 (η = 0.40, λ = 1, θ = 0.30) represents the moderately mixed case, and ZIGP 4 (η = 0.40, λ = 4, θ = 0.30) represents the heavily mixed case. The number in (·) represents the standard error. The results which yields the superior TPR while controlling FDR are highlighted using bold face quantities. Finally, the last scenario considered is when the non-null distribution is Binomial, the fraction of zeros is still 0.40 but the proportion of null cases is increased to 0.80. When the true f 0 exhibits near symmetry, the goal is to determine whether there would be a change in pattern should there be a significant increase in the number of positions without a mutation. The histograms and the corresponding numerical comparison are presented in Web Figure S2 and Web Table S7 in the Supplementary section. Results revealed that the difference betweenĈ 1 andĈ 2 is apparent when there is overdispersion and f 0 is heavily mixed with f 1 . If ZIGP is the model used for the estimation of f 0 , the value of TPR is substantially higher usingĈ 1 , while keeping the FDR controlled. Moreover, the resulting TPR for the moderately mixed case is substantially higher than the TPR for the heavily mixed case, regardless of the model used for the estimation of f 0 and the procedure employed. However, given that FDR is controlled by specifying either of the two models, using ZIGP leads to a higher TPR than when the true model ZIP is specified. This result implies using ZIGP would yield satisfactory results even under model misspecification.
Taking everything into account, for the well-separated and moderately mixed case, if the null model is correctly specified and FDR is controlled, using the Two-Stage procedure yields FDR closest to the nominal level α. Consequently, the Two-Stage procedure is superior in terms TPR in most cases. If the true null model is ZIGP and the null model is correctly specified, FDR is controlled in all procedures. However, the Two-Stage procedure is better than the local FDR procedure and Storey's procedure in terms of TPR. It can also be noted that if the true model is ZIP and ZIGP is used to model the null distribution, then the Two-Stage Procedure still yields the closest FDR to α and leads to higher TPR as compared to the other procedures. This implies using the Two-Stage Procedure when the null model is misspecified would still produce satisfactory results. Moreover, regardless of the shape of the non-null distribution, the Two-Stage Procedure yields better results than the other procedures.
Additionally, the condition (2) can be weakened so that mutation counts belonging in [0, C] can be contaminated by f 1 with probability close to zero. In fact, if there are alternative observations below C, then the true C is changed to be a smaller value (say C ) where all observations below C are actually from f 0 . Intuitively, we can posit that this may affect the estimates of null parameters and the TPR. To address this concern, we perform additional simulation studies using C = 1 which represents that small values of data can be observed from the alternative. The bias and standard error of the parameter estimates are provided in Web Tables S8-S11. Results show that the estimates of null parameters tend to have large biases and standard errors compared to those for C = 5 in Web Tables S2-S7. Furthermore, the corresponding numerical comparisons of FDR and TPR were also provided in Web Tables S12-S15. When there are alternative observations for a small value of C, the situations become more heavily mixed case, so this results in lower TPR while FDRs are still controlled.

Application to Protein Domain Data
In the study of cancer at the molecular level, it is important to understand which somatic mutations contribute to tumor initiation or progression in order to develop new treatments or to identify patients for which a given treatment will be effective. In this field, some researchers focus on the patterns of somatic variants on protein domains due to the well-defined function and structure of these units, which can lead to a better understanding of how these functions are disrupted in cancer Peterson et al., 2010). Here, one interesting issue is identifying "protein domain hotspots", or positions within domains that are found to be mutated frequently . It is among a fixed number of positions in a single domain, which ones are significantly different from the majority.
In application to cancer, Peterson et al. (2012) discovered that known cancer variants tended to cluster at specific positions more than variants involved with unrelated diseases, suggesting that protein domain hotspots could be useful in understanding the molecular mechanisms involved with cancer. It is a novel solution for the application of protein domain hotspots to somatic variants in sequenced tumor samples, which has the potential to distinguish between driver variants that contribute to cancer and passenger variants that do not contribute and are assumed to be distributed randomly.
As an example, we analyze the mutation data obtained from the tumors of 5848 patients from The Cancer Genome Atlas (TCGA) data portal (http://tcga-data.nci.nih.gov/tcga/, Collins and Barker, 2007). These were mapped to specific positions within protein domain models to identify clusters. TCGA MAF files were obtained for 20 cancer types.
Among several hundreds of domains, we focus on six functionally well-known domains to identify the hotspots of somatic variants in TCGA sequenced tumor samples. We start with the hotspots on growth factors (cd00031), which are known to harbor reoccurring somatic mutations involved with clonal expansion, invasion across tissue barriers, and colonization of distant niches (Jeanes et al., 2008). Furthermore, protein kinases (cd00180) and RAS-Like GTPase family of genes (cd00882) are well-known for their role in regulating pathways important to cancer (Tsatsanis and Spandidos, 2000). Genes with kinases or RAS-like GTPases are expected to harbor driver mutations that reoccur at specific sites since they are classic examples of proto-oncogenes that mutate into oncogenes, contributing to cancer (Anderson et al., 1992). Additionally, we identify hotspots on ankyrin domains (cd00204), which play a role in mediating protein-protein interactions important in cancer (Imaoka et al., 2014). Also, we find hotspots on transmembrane domains of proteins that are known to be involved with signal transduction, which is relevant in controlling processes involved with cancer (Sever and Brugge, 2015) and experimental evidence confirms the important regulatory role played by membrane proteins in cancer Neuhaus et al. (2009).
Since the mutation counts are discrete, we apply our proposed method based on various discrete models, such as ZIGP, ZIP, Generalized Poisson and Poisson distribution for f 0 . The estimated parameters based on those models are reported in Web Table S16 in the Supplementary Section. Figure 2 shows the distribution of each protein domain and its total number of positions. In order to assess the general goodness-of-fit using these different parametric models, we estimate the probabilities for a i ≤Ĉ and compared these estimated values with relative frequencies. The results are presented in Web Figures S7-S12 in the Web Supplementary Materials.
The identified number of positions which are mutated differently from expected are in Table 3. For example, when assume that f 0 follows ZIGP, the results show that the identified hotspots on growth factor domain (cd00031) based on One-stage and Two-stage procedures are 143 positions among a total of 366 positions, usingĈ 2 . On the other hand, the local FDR withĈ 1 identifies more hotspots for Two-stage procedure (201) than One-stage (191) and Storey's procedure (200). The rest of the domains can be analyzed in the similar manner. Results from Table 3 revealed that usingĈ 1 yields more rejections. This suggests that the data analysis for the real data shows the same pattern as the simulation results presented previously. Moreover, two domains can be highlighted in terms of the difference in the number of rejections, namely, cd00180 and pfam00001. The number of rejections usingĈ 1 is almost four times higher if the model used for f 0 is ZIGP and the procedure employed is either local FDR or Two-stage method. For both of these data sets, the number of rejections usingĈ 1 is almost twice compared toĈ 2 given that the model for f 0 is ZIGP and using Storey's FDR. Overall, the results for the real data analysis is consistent with the simulation studies.
These results recapitulate much of what is known about how protein domain families contribute to the initiation or progression of cancer and are useful to biologists studying the molecular mechanisms underlying cancer. For instance, hotspots identified on the calcium-binding domain of epidermal growth factors (cd00054) are mapped to the 1EMN structure in Figure 3. Here, all three residues known to bind to calcium are also identified as hotspots, confirming the known importance of this binding pocket in cancer. In addition, we identify 42 other residues on the domain as hotspots that do not bind with calcium but tend to occur around the binding pocket and are present in many patients, suggesting their importance in cancer.

Conclusion
In this article, our primary interest is to select significant mutation counts for a specific protein domain, while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. For the ith position, the number of mutations a i follow one of the two distributions f 0 or f 1 , namely, the null or alternative distribution, respectively. With π 0 and 1 − π 0 representing the prior probabilities of the two groups, the probability density function of the mixture distribution can be represented as f (a i ) = π 0 f 0 (a i ) + (1 − π 0 )f 1 (a i ). We assume that if the number of mutations a i ≤ C, then a i is guaranteed to be from the null model, for some positive integer C, i = 1, 2, . . . , N. We propose a method for identify a cutoff C and show that this is superior to the cut-off developed by extending Efron's proposal. In addition, after the selection of this cut-off, we consider a screening process so that the number of mutations exceeding a certain value D N (Ĉ) (Ĉ + 1 ≤ D N (Ĉ)) should be considered as significant mutations.
The proposed two-stage procedure in the selection of C and D N yielded a testing procedure which is superior in terms TPR in most cases. For the well-separated and moderately mixed case, if the null model is correctly specified, then using the Two-Stage procedure yields FDR closest to the nominal level α and the highest TPR. It can be noted that if the true model is ZIP and ZIGP is used to model f 0 , then the Two-Stage Procedure still yields the closest FDR to α and leads to highest TPR. This means that even when the null model is misspecified, the Two-Stage procedure would still produce satisfactory results. Also, regardless of the shape of the alternative distribution, the Two-Stage Procedure yields better results than the other procedures. Furthermore, results from six chosen protein domain data sets revealed that using the proposed cut-off for C yielded more rejections. In general, the results for the real data analysis is consistent with the simulation studies. These results sum up how protein domain families contribute to the initiation or progression of cancer and are useful to biologists studying the molecular mechanisms underlying cancer.

Supplementary Materials
Web Appendix A, referenced in Section 3.2 and Web Appendix B which is referenced in 3.4, is available with this article at the Biometric website on Wiley Online Library. The R codes used in this article are available at the Biometrics website on Wiley Online Library.