Semi‐parametric analysis of overdispersed count and metric data with varying follow‐up times: Asymptotic theory and small sample approximations

Abstract Count data are common endpoints in clinical trials, for example magnetic resonance imaging lesion counts in multiple sclerosis. They often exhibit high levels of overdispersion, that is variances are larger than the means. Inference is regularly based on negative binomial regression along with maximum‐likelihood estimators. Although this approach can account for heterogeneity it postulates a common overdispersion parameter across groups. Such parametric assumptions are usually difficult to verify, especially in small trials. Therefore, novel procedures that are based on asymptotic results for newly developed rate and variance estimators are proposed in a general framework. Moreover, in case of small samples the procedures are carried out using permutation techniques. Here, the usual assumption of exchangeability under the null hypothesis is not met due to varying follow‐up times and unequal overdispersion parameters. This problem is solved by the use of studentized permutations leading to valid inference methods for situations with (i) varying follow‐up times, (ii) different overdispersion parameters, and (iii) small sample sizes.

groups. Such distributional assumptions, however, can hardly be verified; especially in case of small to moderate sample sizes (Aban, Cutter, & Mavinga, 2009). Even if the distribution is correctly specified the MLEs of the overdispersion parameters are biased (Link & Sauer, 1997;Lord, 2006;Paul & Islam, 1995;Saha, 2011;Saha & Paul, 2005) that may lead to wrong conclusions. Moreover, it is quite common that varying follow-up times occur, see for example, Chen et al. (2013), McCullagh and Nelder (1989). All of the above-mentioned characteristics may not only be shared by count data, but also by metric data measured on an arbitrary scale. Simultaneously accommodating all of these complications in an accurate statistical inference method in a unified way is a rather challenging task. To the best of our knowledge no suitable methods currently exist that can simultaneously handle heteroscedastic data (counts) with varying follow-up times.
It is the aim of the present paper to develop valid inference procedures for the analysis of such data in general models allowing for possibly time-varying follow-up times and different overdispersion parameters in a nonparametric way. This is accomplished by newly derived unbiased estimators (based on the methods of moments) for the (count) rates and their variances. The rigorous study of their large sample properties then leads to asymptotically correct tests and confidence intervals for treatment effects using critical values from the standard normal distribution.
With small samples the use of normal quantiles for inference can lead to liberal or conservative decisions whereas permutation tests offer an opportunity to derive quantiles from appropriate reference distributions. In particular, the application of studentized permutation procedures is tempting since they have been shown to control the type--error rate very accurately in various situations (Chung & Romano, 2013;Chung & Romano, 2016;Janssen, 1997;Konietschke & Pauly, 2014;Pauly, Brunner, & Konietschke, 2015). The problem in this particular situation is that with varying follow-up times and unequal overdispersion parameters the usual assumption of independently identically distributed (iid) observations in the groups is not met. This issue can be solved by applying more general theorems on permutation statistics by Janssen and Pauls (2003) and Janssen (2005) and Pauly (2011). Even though data may not be exchangeable under the null hypothesis, the derived permutation methods are asymptotically correct in that they control the type I error rate or the coverage probability for hypothesis tests and confidence intervals, respectively.
The paper is organized as follows: The statistical model and point estimates are given in Section 2. Unbiased variance estimators are provided in Section 3. In Section 4, test procedures and confidence intervals are derived. Permutation-based small sample size approximations and simulation results are presented in Section 5. Finally, two illustrative data examples are analyzed in Section 6. The paper closes with a discussion of the proposed methods in Section 7. All proofs are given in the supplement to this paper.
Here, the index represents the treatment groups ( = 1 control, and = 2 treatment), and the subject within treatment group with individual follow-up time , and > 0 the expectation of group . Note that the variance 2 may depend on , for example if follows a Negative Binomial distribution (in this special case 2 = + 2 2 ), a Poisson distribution ( 2 = ), or an Exponential distribution (here 2 = 2 2 ). We further assume that the fourth moments exist and are bounded, that is sup ≥1 ( 4 ) ≤ 0 < ∞ for a constant 0 > 0 and = 1, 2.
The design is allowed to be completely heteroscedastic, that is every observation might have a different expectation and variance. All statistical procedures for the analysis of iid observations are inappropriate for statistical inference in model (1).
denote the total sample size, = ∑ =1 the total follow-up times in group , = 1, 2, and let = ∑ 2 =1 denote the total follow-up times across both treatment groups. The unknown rate parameters can be estimated without bias bŷ and can be interpreted as a weighted mean of the data. The variance of̂is given by For the derivation of asymptotic results for the rate estimates (2), the following mild regularity conditions on sample sizes and follow-up times are required: Assumption (4) ensures that the follow-up times appear on a fixed time interval of interest, while Assumptions (5)-(7) guarantee the existence of limiting variances of the point estimates, see Theorem 2.1 below. In particular, it follows immediately, that the estimator̂is consistent as → ∞ and → ∞, respectively. However, the variance 2 defined in (3) represents an unknown weighted sequence of the quantities 2 , which depends on both the follow-up times and sample sizes. Thus, it cannot be represented by model constants. In order to derive inference methods for the general hypothesis 0 ∶ ℎ( 1 , 2 ) = 0 , however, the estimator needs to be multiplied by adequate known coefficients, such that 2 converges to a specific variance constant, which is, asymptotically, unaffected by the follow-up times and sample sizes. The result along with the multivariate normality of the estimator̂= (̂1,̂2) ′ of = ( 1 , 2 ) ′ are given in the next theorem.
Note that the diagonal covariance matrix neither depends on the sample sizes , nor on the time-varying coefficients . The matrix, is, however, unknown in practical applications, and needs to be estimated. An unbiased and  2 -consistent estimator is derived in the next section.

ESTIMATION OF THE VARIANCE
Moment-based estimators for variances denote, roughly speaking, the squared deviation from the mean. In model (1), however, no uniquely defined mean exists. In particular, the variance 2 is a sum of variances, and is not defined as a fixed variance constant. Therefore, the usual sample variance moment-based estimator is biased, a rather inappropriate characteristic of a variance estimator. Below, we derive an unbiased and consistent moment-based estimator of 2 .
Define the random variables̃= −̂, and note that (̃) = 0 for all = 1, 2, and = 1, … , . The variables describe the deviation of to its estimated expectation. An unbiased moment-based estimator can now be derived by considering the squared deviation from̃along with a bias correction. Define and consider̃2 The estimator̃2 is not a usual sample variance estimator, since it only involves sums of the follow-up times as weighting factors. However, it describes the mean squared deviation from the observations to their estimated mean̂. Further let denote the diagonal matrix with diagonal elements 1 2̃2 1 and 1 2̃2 2 , respectively. It is shown in the next theorem, that̃2 is an unbiased estimator of 2 and that̂is  2 -consistent.
Theorem 3.1. For each = 1, 2 the estimator̃2 is an unbiased estimator of 2 . Moreover, the estimator̂is  2 -consistent, that is A detailed proof is given in the supplementary material.
Remark. We note that the variance estimator̃2 may become negative in "severe" situations, that is if any is way larger than the others. In this case we suggest to use the asymptotically unbiased versioñ The asymptotic normality of the point estimates and the consistent variance estimates can now be used for the derivation of test procedures and confidence intervals.

SMALL SAMPLE APPROXIMATIONS AND SIMULATION RESULTS
Extensive simulations were conducted to investigate the accuracies of the test procedures derived in Section 4 for small sample sizes with regard to (i) controlling the type-1 error rate at the nominal significance level ( = 5%), (ii) their powers to detect certain alternatives 1 ∶ ℎ( 1 , 2 ) ≠ 0 , and (iii) the coverage probabilities of the corresponding confidence intervals in (16). All simulations were conducted with R environment, version 2.15.2. (R Development Core Team, 2010), each with = 10, 000 simulation runs.
In all simulations, we focus on testing the hypothesis corresponding to the function ℎ( 1 , 2 ) = ( 1 , 2 ) = log( 1 ∕ 2 ). The test statistic is given by which yield to asymptotically valid tests Moreover, confidence intervals can be derived from (16), respectively. Simulation studies indicate, however, that the statistic ( ) in (18) tends to result in rather liberal conclusions for small sample sizes ( ≤ 20). Therefore, we propose a studentized permutation approach to approximate its sampling distribution for small sample sizes. This will be explained in the next section.

A studentized permutation approach
Permutation tests are widely known to be robust and exact level tests when the data are exchangeable. Exchangeability implies, however, that variances across the groups are identical. As mentioned above, the data are allowed to be completely heteroscedastic in model (1). Roughly speaking, a usual permutation test would fail to test the null hypotheses formulated above. However, asymptotic permutation tests can be obtained, if appropriate studentized statistics are permuted, which will now be briefly explained: It turns out that the test statistic ( ) follows, asymptotically, a standard normal distribution under the null hypothesis. A permutation or resampling test would now lead to accurate results (at least asymptotically), if the conditional permutation distribution of the test statistic ( ) , say * , would generally mimick the null distribution of the test statistic. That is, both distributions should at least coincide asymptotically. If that is the case, critical values (or P-values) could be computed from the permutation distribution instead of the standard normal distribution for making inferences. Therefore, the goal of the following investigations is to show that the permutation distribution of ( ) , * , is indeed the standard normal distribution. In order to do so, some notations and ideas about the permutation schemes are necessary: Let = ( 11 , … , 1 1 , 21 , … , 2 2 ) ′ denote the pooled sample, and let = ( 11 , … , 1 1 , 21 , … , 2 2 ) ′ denote the corresponding vector of the pooled follow-up times . For a fixed, but random permutation of (1, … , ), let = ( 11 , … , 1 1 , 21 , … , 2 2 ) ′ and = ( 11 , … , 1 1 , 21 , … , 2 2 ) ′ denote the permuted data and corresponding follow-up times, respectively.
Permuting and using the same random permutation , the permuted values and are not necessarily independent, which is a rather (at least technically) undesirable property in this context. We therefore propose to permute and independently. This is similar to two sample problems with right-censored survival data, where it is also recommended that the permuted failure times do not occur in general with their corresponding censoring indicators, see Janssen and Mayer (2003) as well as Brendel, Janssen, Meyer, and Pauly (2014). To this end, we consider another random permutation ′ of (1, … , ) that is independent of and calculate the permuted estimatorŝ( , ′ ) =̂( , . Note that the possible number of random permutation is considerably increased when permuting both and independently. It turns out that the distribution of the test statistic ℎ(̂1,̂2) differs in the general model (1) from its permutation distribution, and a valid level test can not be achieved in this setup. Therefore, we consider the distribution of the test statistic (ℎ) defined in (15) and of the studentized quantity The conditional limiting distribution of ( , ′ ) (ℎ) given the data will be derived in the next theorem.
as given in (19) and denote by Φ( ) the standard normal distribution function. If 2 > 0, then we have convergence under the null as well as under the alternative with Theorem 5.1 states that the limiting standard normal distribution of ( , ′ ) ( ) does not depend on the distribution of the data, particularly, it is achieved for arbitrary ℎ( 1 , 2 ) = 0 , that is it even holds under the alternative.
∕2 denotes the ∕2-quantile from the studentized permutation distribution of ( ) . In the next theorem, we will show that both the conditional and unconditional tests are asymptotically equivalent, which means, that both tests have, asymptotically, the same power to detect certain alternatives.
Theorem 5.2. Suppose that the assumptions of Theorem 5.1 are fulfilled.

2.
The permutation test ( , ′ ) is consistent, that is we have convergence In particular, Theorem 5.1 states that the distributions of the pivotal quantity (ℎ) and of the studentized permutation statistic asymptotically coincide. Under the assumptions of Theorem 5.1, approximate (1 − )-confidence intervals for can be obtained from

Simulation results
In a negative binomial-( , )-model we investigate the empirical control of the preassigned type-1 error rate at the usual two-sided significance level = 5% of the statistic ( ) in (18) using the standard normal approximation as given in (15), and the permutation test using the quantiles of the conditional distribution of ( , ′ ) (ℎ) in (19) as critical values. As a further competing procedure, we estimate the variances 2 using maximum likelihood methods. In this ( , )-model the variance 2 is given by the weighted sequence of the quantities + 2 2 , respectively. An intuitive plug-in estimation approach is achieved by replacing the unknown parameter bŷfrom above and by a consistent maximum-likelihood estimator (ML) , for example by using̈2 T A B L E 1 Simulated designs, where ∈ {0, 5, 10, 20, 25} and 1 = (7, 7) ′ , 2 = (7, 15) ′  see, for example Schneider, Schmidli, and Friede (2013). This estimation approach, however, has the disadvantage that neither̂2 nor̂are unbiased estimators of 2 or , respectively, resulting in biased variance estimators. The variance estimatorŝ2 used in ( ) are finally replaced bÿ2, and the corresponding Wald-statistic, which is asymptotically equivalent to the Likelihood-ratio test, denoted by LRT.

Type-1 error rate simulations
We explore the behavior of the test statistics for smaller and larger effect rates 1 and 2 ∈ {1.5, 10} as well as smaller and larger overdispersion parameters 1 and 2 ∈ {0.3, 0.5, 3, 5}.
All simulation designs are motivated by the examples presented in Section 6. A major assessment criterion for the accuracy of the procedures is their behavior when increasing sample sizes are combined with increasing variance parameter constellations (positive pairing) or with decreasing variances (negative pairing). We investigate balanced situations with sample size vector 1 = ( 1 , 2 ) ′ = (7, 7) and unbalanced situations with sample size vector 2 = ( 1 , 2 ) = (7, 15) ′ . The sample sizes are increased by adding a constant to the components of the vectors or , respectively. The different simulation settings are displayed in Table 1. Each simulation setting = ( ) = ( 1 + , 2 + ) ′ represents a different design with an increasing sample size , where = 1, 2, see Table 1. Data were generated from ∼ ( , ), where denotes the realization from a uniformly distributed random variable ∼ (1, 2), respectively. For each simulation setting, the same generated follow-up times were used for the = 10, 000 simulation runs, but they were newly generated for each design. The number of random permutations was set to = 10, 000. The simulated type-1 error rates for a significance level = 5% assuming uniformly distributed follow-up times are displayed in Figure 1.
It turns out that in case of small effect rates ( 1 = 2 = 1.5) and small overdispersion parameters the statistics ( ) based on the normal approximation as well as the LRT statistics based on ML tend to be slightly liberal. It can be readily seen from Figure 1 that the permutation tests control the type-1 error rate best, even for extremely small sample sizes. In case of larger effect rates and overdispersion parameters the distribution of the data is much more skewed. In these situations the procedures ( ) based on the normal approximation and ML tend to considerably overreject the null hypothesis ( ) 0 . Remarkably, the estimated type-1 error rates are even larger than 20% and 10%, respectively in Settings 6-10 (see Figure 1). In comparison, the permutation technique greatly improves the finite sample performance of all asymptotic procedures, and is therefore recommended in practical applications.
In order to investigate the impact of the underlying distributions of the follow-up times, we resimulate the same designs with exponentially distributed follow-up times ∼ (2) + 1. The results are displayed in the supplementary material. It can be seen that the shape of the underlying follow-up times distributions slightly affect the behavior of the statistics in all scenarios. This is intuitively clear, since the different follow-up times particularly influence the variance of the effect estimators, and increase the variance with wider ranging follow-up times or certain amount of skewness. Therefore, all procedures tend to be slightly more liberal when wide ranging follow-up times and small sample sizes are apparent. This can be particularly seen by the permutation test. The liberality, disappears with increasing sample sizes. F I G U R E 1 Type-I error level ( = 5%) simulation results (y-axis) of the statistics ( ) in (18), permutation test ( , ′ ) (ℎ) in (19) and ML-based statistics for different distributions, sample size increments ∈ {0, 5, 10, 15, 20, 25} (x-axis), where denote the realizations from ∼ (1, 2). The simulation settings are described in Table 1

Power comparisons
The type-1 error rate simulation results presented in Section 5.2.1 indicate a quite liberal behavior of the methods ( ) and MLbased statistics under certain parameter constellations and small sample sizes. All methods tend to accurate conclusions with large sample sizes. The liberality of these methods increases the "power" of the methods to detect alternatives in small sample size settings. In an additional simulation study, not presented here, it turned out, that with large sample sizes, that is when all competing methods are accurate, their powers are all very similar.

Simulated coverage rates of the confidence intervals
Next we investigate the empirical coverage probabilities of the corresponding confidence intervals. Data were generated by 1 ∼ ( 1 1 , 1 ), = 1, … , 1 and 2 ∼ ( 1 (1 + ) 2 , 2 ) for varying ∈ {0, 0.1, 0.2, … , 2}, 1 , 2 ∈ {10, 20}, and different overdispersion parameters. For illustration purposes, we only display the results using uniformly distributed followup times, different overdispersion parameters 1 = 3 and 2 = 5 and rate 1 = 10. The results are displayed in Figure 2. It is readily seen that the competing procedures tend to be rather liberal, while the empirical coverage probabilities of the permutationbased confidence intervals are closer to the nominal level of 95%. The quality of the approximation depends on sample sizes and F I G U R E 2 Empirical coverage probabilities of nominal 95% confidence intervals of the corresponding confidence intervals given in (16), permutation-based confidence intervals given in (20)  in (19)  the actual levels of heteroscedasticity across the groups and their allocations. If the larger sample has a smaller variance than the smaller sample ( 1 = 20, 2 = 10), the confidence intervals tend to be slightly liberal for small samples. However, this issue vanishes with increasing sample sizes.

Simulation results for general metric data
As mentioned in the Introduction and in the description of model (1), data is not required to be count data and thus, numerical investigations of the behavior of the studentized permutation test are intriguing. We therefore investigate the empirical control of the type-1 error rate of the studentized permutation test ( , ′ ) ( ) in completely heteroscedastic designs with metric data following exponential or 2 -distributions. The method will be compared with ( ) using the standard normal approximation. Exponentially distributed variables were generated by ∼ ( ⋅ 1∕2), = 1, 2, = 1, … , , and 2 -variables were generated by ∼ 2 , respectively.
The results are displayed in Table 2 and show that the studentized permutation approach controls the nominal type-1 error rate very well and greatly improves the standard normal approximation.

T WO ILLUSTRATIVE EXAMPLES
Pediatric MS with disease onset under the age of 16 is uncommon and qualifies as a rare disease. Differences in clinical presentation before and after puberty have been reported (Huppke et al., 2014). Randomized controlled trials in pediatric MS have been very rare (Unkel et al., 2016), but are becomming more common now (Rose & Müller, 2016). We consider a randomized controlled trial assessing efficacy and safety of interferon beta-1a compared to no treatment in pediatric MS reported by Pakdaman, Fallah, Sahraian, Pakdaman, and Meysamie (2006). In this trial, 16 patients were randomized to verum or control. Relapse rates and new T2 lesions were both considered as endpoints. The estimated rates and overdispersion parameters are given in Table 3. As a second example, we consider the Acyclovir trial reported by Lycke et al. (1996). In this experiment, Acyclovir treatment was used in a randomized, double-blind, placebo-controlled clinical trial with parallel groups to test the hypothesis that herpes virus infections are involved in the pathogenesis of MS. In total, = 60 adult patients were recruited, whereas 1 = 2 = 30 were randomized to placebo or active treatment, respectively. The data (relapse counts) can be found in Figure 1 in the original publication (Lycke et al., 1996). As a secondary analysis of this trial, the relapse counts from patients that showed a progressive course during the trial were excluded from the statistical analysis. In this situation, patients have different follow-up times and estimators must be weighted accordingly.
The estimated rates and overdispersions being defined as variance-to-mean ratios are given in Table 3. It can be readily seen from Table 3 that the overdispersion parameters seem to differ between the treatment groups, and even underdispersed counts are apparent. The effect of the different overdispersion parameters on the behavior of the statistical methods has been analyzed in detail in extensive simulation studies in Section 5.2.
Both motivating examples discussed above used over-and underdispersed counts as outcomes. Here, we present the results based on standard methods including normal approximation and maximum-likelihood as well as the new developed methods. The test statistic being used is given by denotes the estimated variance of the effect estimator using a MLE estimator of the overdispersion parameter , which is assumed to be identical across both treatment groups. As competing methods, we also analyze the data using both a Negative Binomial Regression-and Poisson Regression using SAS PROC GENMOD. Thus, the illustrative examples include constant as well as varying follow-up times, and even the analyses with constant followup times still presents a challenge since the sample sizes are with 16 and 60 very and moderately small, and the overdispersion is fairly pronounced, in particular for the MRI lesion counts and relapses. The effect estimates, standard errors, test statistics, P-values as well as 95%-confidence intervals are displayed in Table 4.

T A B L E 3 Estimated rates and overdispersion parameters (Variance / Mean Ratio) for the two example studies
It can be readily seen from Table 4, that the estimated standard errors of the effect estimates for the T2 lesions are likely, and therefore all methods results in the same conclusion. Only the estimated standard error being computed via a Poisson-Regression tends to be smaller. This occurs because the Poisson-Regression sets the overdispersion to be zero, by default. A significant effect at 5% level can not be detected with any method (P > 0.05). The relapse rates are significantly different at 5%level of significance. It can be seen, however, that the estimates of the standard errors significantly differ from the moment-based unbiased variance estimators (SE = 0.216 vs. SE = 0.302 using ML). Therefore, the P-values based on ML estimates are larger than using the moments-based estimator and standard normal distribution (P = 0.003 vs. P = 0.034). However, since sample size is rather small, the permutation approach is the most robust method in this setup, and results in a P-value of P = 0.026. Since both over-and underdispersed counts were observed, the ML.Pool, the negative binomial, and poisson regression are tend to provide identical results.
The results obtained for the Acyclovir trial, however, differ significantly. First, both treatment groups show a different overdispersion. Therefore, the SE obtained by a Poisson-Regression is way smaller than with all other methods, and thus results in a significant treatment effect at 5% level of significance. Comparing the other estimation approaches it can be seen that the ML-based estimation approaches (assuming negative binomial distribution) of the SE tend to be larger than the unbiased methods-of-moments based methods. The largest SE is estimated via ML.Pool (which is identical to a NB-Regression). The estimated standard error based on the unbiased variance estimate is given by SE = 0.215. Therefore, the P-values range from 0.052 through 0.071. Due to the moderate sample size of = 60, both the normal and permutation approximation tend to provide similar P-values with = 0.052 and = 0.054, respectively. The secondary analysis of the the Acyclovir trial shows similar results to the above. This occurs because only the relapse counts from four of the 60 patients were excluded from the analysis. However, slightly different effect estimates coming from the Negative Binomial and Poisson Regression can be seen. This occurs, because in case of unequal follow-up times the rates are estimated using maximum likelihood estimation methods, which are not identical to moment (mean-based) methods.

DISCUSSION
In this paper, inference methods for testing hypotheses formulated in terms of the effect rates of overdispersed counts were developed without assuming a specific data distribution and/or different overdispersion parameters. They are based on the asymptotic properties of novel unbiased estimators of the count rates and their variances. In order to provide valid methods for small sample sizes, resampling methods have been derived. Although data is in general not exchangeable, following the ideas of Neuhaus (1993), Janssen (1997Janssen ( , 2005, and Chung and Romano (2013), studentized permutation techniques could be applied. Simulation studies indicate, however, that the procedures control the nominal level reasonably well even with ≈ 5.
Furthermore, in clinical trials, the computation of confidence intervals for the treatment effects is important, following the ICH E9 guideline for randomized clinical trials: "Estimates of treatments shall be accompanied by confidence intervals, whenever possible& (ICH E9 Guideline 1998, chap. 5.5, p. 25). For instance, Saha (2013) investigates different methods for the computation of confidence intervals for the mean difference in the analysis of overdispersed count data (assuming constant follow-up times ). In this paper, these procedures were generalized for possibly time-varying and overdispersed count data and equipped with the studentized permutation approach. Extensive simulation studies show that the new methods improve the existing methods in terms of coverage probability and type--error rate control. Furthermore, we only considered one possible unbiased estimator of the rates bŷ= 1 ∑ =1 , which is known as a weighted mean estimator. Another unbiased estimator is given by the unweighted mean̂( ) = 1 ∑ =1 , or least-square based estimatorŝ= ( ′ ) −1 ′ , where = ( 1 . … , ) ′ and = ( 1 , … , ) ′ denote the vectors of follow-up times and response per group , respectively. Investigating and comparing those estimators and generalizations thereof is tempting and will be subject to future research.
In future investigations, the results shall be extended to more general models allowing for covariates (e.g. for baseline adjustment) and several samples. Furthermore, investigating the overlap of range-preserving confidence intervals for the effects is an interesting attempt for making inferences (Noguchi & Marmolejo-Ramos, 2016).