Blinded sample size reestimation for negative binomial regression with baseline adjustment

In randomized clinical trials, it is standard to include baseline variables in the primary analysis as covariates, as it is recommended by international guidelines. For the study design to be consistent with the analysis, these variables should also be taken into account when calculating the sample size to appropriately power the trial. Because assumptions made in the sample size calculation are always subject to some degree of uncertainty, a blinded sample size reestimation (BSSR) is recommended to adjust the sample size when necessary. In this article, we introduce a BSSR approach for count data outcomes with baseline covariates. Count outcomes are common in clinical trials and examples include the number of exacerbations in asthma and chronic obstructive pulmonary disease, relapses, and scan lesions in multiple sclerosis and seizures in epilepsy. The introduced methods are based on Wald and likelihood ratio test statistics. The approaches are illustrated by a clinical trial in epilepsy. The BSSR procedures proposed are compared in a Monte Carlo simulation study and shown to yield power values close to the target while not inflating the type I error rate.

This is obvious for normally distributed outcomes, but also holds for nonnormally distributed outcomes such as binary endpoints. 3 Therefore, the sample size planning and the analysis should match to obtain a sample size which maintains a target power.
Depending on the scale of the outcome variable, different sample size estimation methods exist. For normally distributed outcomes, Frison and Pocock 4 proposed a sample size formula for the analysis of covariance (ANCOVA). For studies with dichotomous outcomes, Hernandez et al 3 state that an adjustment for a predictive variable in the analysis may lead to an increase in statistical power. In practice, sample size planning and analysis frequently mismatch as Lyles et al 5 realize. One reason might be that it is often difficult to obtain reliable estimates of the correlations between the baseline variables and the outcome variable. In these situations, designs with internal pilot study (IPS) 6 provide a framework to deal with uncertainties in nuisance parameters such as correlations between baseline and outcome variables. For instance, Friede and Kieser studied such designs considering ANCOVA models with tests for superiority 7 and noninferiority. 8 They proposed blinded procedures as they have favorable characteristics from a regulatory point of view as compared to unblinded procedures. 9,10 Many clinical trials have endpoints modeled as count data, for example, the number of malignant melanoma lesions or number of seizures in epilepsy patients. Recurrent event data, for example, the number of multiple sclerosis (MS) relapses during follow-up, can also be considered as count data. For the analysis of count data, in general, a Poisson regression or, in case of overdispersion, a negative binomial regression can be used to consider covariates. Under the assumption of constant event rates, these regression models can also be applied to analyze data with varying follow-up times by including the logarithm of the follow-up times as off set in the regression model.
Several authors proposed sample size formulas for trials with count outcomes. While some assume specific distributions such as negative binomial distribution, 11,12 others consider more general situations. [13][14][15] In addition, some approaches consider covariates, with Signorini 16 being one of the early proposals. Lyles et al 5 review approaches for the period up to 2005 and propose a very flexible approach for the power calculation in generalized linear models (GLM). Since then, further approaches for the sample size planning of specific designs were proposed, for example, for clustered data, 17 variable follow-up times, 18 noninferiority trials, 19 and overdispersed data. 13 However, the general approach from Lyles et al 5 can also be applied to these designs.
Cook et al, 20 Schneider et al, 21,22 as well as Friede and Schmidli 23,24 proposed adaptive designs for nuisance parameter based sample size reestimation with count outcomes, but did not consider covariates. Therefore, the aim of this article is to propose blinded sample size reestimation (BSSR) procedures for clinical trials with count data considering covariates and to systematically investigate their operational characteristics. The proposed methods are based on Wald or likelihood ratio (LR) test statistics and use the expected Fisher information (FI) or the Lyles approach for the sample size estimation.
In the next section, the clinical evaluation of antiepileptic drugs is discussed and the example dataset is described. In Section 3, the statistical model, the applied test statistics, and the sample size calculation approaches are presented, whereas in Section 4, a procedure for sample size reestimation is proposed. The statistical properties of the approaches regarding type I error rate and statistical power are investigated in Section 5, and the approaches are applied to the example dataset in Section 6. We close with a brief discussion of the results in Section 7.

CLINICAL EVALUATION OF ANTIEPILEPTIC DRUGS
The Food and Drug Administration (FDA) and European Medicines Agency (EMA) generally define the efficacy of anti-epileptic drugs in terms of the number of seizures. The EMA recommends the reduction in the frequency of seizures as a primary endpoint for exploratory studies. 25 In contrast, for therapeutic confirmatory studies, the responder rate is recommended as a primary endpoint, whereas a response is defined as a seizure reduction of 50%. 25 For late phase I and phase II, as well as for phase III studies, the FDA recommends the percentage change from baseline in seizure frequency as primary endpoint. This may lead to studies with different primary endpoints for different regions (see, eg, Rosenfeld et al 26 ). A crossover study design is mentioned in both guidelines as one possible design (in the EMA guideline only for exploratory studies). However, in the large randomized studies published over the past few years, a parallel group study design was used predominantly. [27][28][29][30] Regarding the analysis of seizure counts, the EMA guideline explains that "the distribution of seizure frequencies are usually heavily skewed." Therefore, in phase III studies, the outcome is usually transformed using a rank-or a log-transformation. Most randomized studies analyzed the seizure frequency as percent change from baseline using an ANCOVA with the baseline seizure frequency as covariate. 26,31,32 However, in the mentioned studies, the sample size calculation ignored the baseline covariate effect and therefore did not match weeks pretreatment and 8 weeks on treatment in the two treatment groups observed in the study by Leppik et al 34 (boxplot indicates mean +; median line; box: 25%-75% quantile; whisker: 5%-95% quantile) the analysis approach. We found no phase III trial with a sample size reestimation, although it often would have been helpful, as the assumptions for the sample size calculation differed considerably from the study results in several cases, implying potentially underpowered or overpowered studies. 27,28,32 For example, in Reference 27, for the sample size calculation, the mean log-transformed 28-day seizure rate was estimated to be 2.59 for placebo, 2.32 for pregabalin (PGB) as intervention with dosage of 150 mg, and 2.01 for PGB 300 mg. By contrast, in the end, the adjusted mean values estimated using ANCOVA were 1.96, 1.95, and 1.82 for placebo, PGB 150, and PGB 300, respectively. For confirmatory studies, the EMA and FDA guidelines recommend at least 12 weeks as maintenance period, which is also standard in large-scale confirmatory trials. 28,29,32,33 For the baseline period, no specific duration is recommended, but mostly a 4-week interval is used. [27][28][29]32,33 The sample size in recently published phase III trials ranged from 100 to 300 patients per group. 27,28,30,32 The median weekly seizure rates in these trials ranged from 1.75 to 3.6. Because only the median number of seizures in the baseline period was reported in the considered articles, it was not possible to calculate the underlying overdispersion.
To motivate our work, we consider the double-blind randomized study for the comparison of proabide and placebo from Leppik et al 34 as an example. This study has already served as an example in other methodological articles. 35,36 The sample size of 59 epileptic patients was small compared to the phase III trials mentioned above. The primary endpoint in the study was the weekly seizure rate in an 8-week interval on treatment, and as a covariate the seizure count in the preceding 8-week interval was used. The median number of seizures per week decreased from 2.3 to 2 in the placebo arm and from 3 to 1.9 in the proabide arm. These seizure rates are comparable to those of the above mentioned trials and follow a skewed distribution, as shown in Figure 1. This example is revisited and analyzed by negative binomial regression in Section 6.

STATISTICAL MODEL, HYPOTHESIS TESTING, AND SAMPLE SIZE PLANNING
Prior to presenting methods for BSSR, the focus of this article, we recapitulate the negative binomial regression model, discuss two group comparisons in the negative binomial regression model, and present sample size planning for the discussed hypothesis tests.

Statistical model
Let the random variable Y i ∈ N 0 denote the number of events of subject i = 1, … , n after some subject specific follow-up time t i ∈ [0, T] and let X i ∈ R p be the corresponding p-dimensional vector of baseline covariates and group affiliation, in the following referred to as explanatory variables. We model the number of events Y i by a negative binomial regression model. 37 In detail, the number of events Y i has the probability mass function where i is the expected number of events conditioned on the explanatory variable, that is, The conditional variance is given by We refer to as the shape parameter. For increasing , the variance converges to i , the variance of a Poisson distribution. In this manuscript, we focus on the case in which the subjects are allocated to two treatment arms: the experimental treatment group and the control group. The treatment effect can be modeled through the explanatory variable X i by setting its first entry to one, that is, X i1 = 1, and its second entry to an indicator variable which is zero if subject i is in the control group, that is, X i2 = 0, or one if subject i is in the experimental treatment group, that is, X i2 = 1. Thus, the first entry 1 of the parameter vector corresponds to the intercept and the second entry ∶= 2 is the treatment effect in terms of the log rate ratio. We assume that smaller rates of i are better and, therefore, the smaller , the more efficient is the experimental treatment compared to the control. The statistical hypothesis testing problem of interest is given by where ∈ R, commonly = 0 is chosen. In the following subsection, we present several statistical tests for the hypothesis H 0 .

Hypothesis testing
In this subsection, we discuss two common likelihood-based tests, a Wald and the LR test, for the negative binomial regression model. Both tests are conditional tests, that is, even though we modeled the explanatory variables X i (i = 1, … , n) as random, the hypothesis tests are defined conditioned on the explanatory variables.

Wald test
A Wald statistic is defined as the ratio of an estimator for the parameter of interest and its standard error. Here, we focus on estimating the log rate ratio through the maximum likelihood approach. Let̂be the maximum likelihood estimator of the parameter vector ,̂be the maximum likelihood estimator of the shape parameter , and let I( , ) be the corresponding FI matrix. The conditional FI matrix I( , |X), where X is a placeholder for X 1 , … , X n , can be split into diagonal parts relating to and and zeroes otherwise, that is, ) .
According to Lawless, 37 the maximum likelihood estimator̂asymptotically follows a multivariate normal distribution, that is, Since the treatment effect is the second entry of the parameter vector , the maximum likelihood estimator̂is the second entry of the maximum likelihood estimator̂. The maximum likelihood estimator̂asymptotically follows a normal distribution, where c is a vector that has a one as the second entry and zeros otherwise. With these preliminaries, we define the Wald statistic as The maximum likelihood estimatorŝand̂are consistent and with Slutsky's theorem, it follows that c ⊤ I (̂,̂|X) −1 c is a consistent estimator of c ⊤ I ( , |X) −1 c. Furthermore, under the null hypothesis H 0 , the Wald statistic T W is asymptotically 2 distributed with one degree of freedom and an asymptotic level test of H 0 is obtained by rejecting H 0 when the Wald statistic T W is larger than the (1 − )-quantile 2 1,1− of a 2 distribution with one degree of freedom. It is worthwhile noting that the asymptotic distribution of the maximum likelihood estimators and the Wald statistic hold only conditioned on the explanatory variables.
We defined the test statistic such that it is asymptotically 2 distributed. However, the test statistic can also be defined such that it is asymptotically standard normally distributed. Since a squared standard normal distribution corresponds to a 2 distribution, the two definitions are equivalent for testing H 0 . Furthermore, defining the test statistic such that it is standard normally distributed would also allow for one-sided testing of H 0 , whereas the tests based on the 2 distribution are inherently two-sided.

LR test
Let Ω = R p × R >0 be the parameter space associated with the parameter vector ( , ). The parameter space , that is, it is obtained by imposing the null hypothesis = to the parameter space. Then, the LR statistic is defined by where (⋅, ⋅|Y 1 , … , Y n , X 1 , … , X n ) denotes the likelihood function of the negative binomial regression model. 37 Under the null hypothesis H 0 and conditional on the explanatory variables, the LR statistic T LR is asymptotically 2 distributed with one degree of freedom. 38 Thus, an asymptotic level test of the null hypothesis H 0 is obtained by rejecting the null hypothesis of the LR statistic T LR that is larger than the (1 − )-quantile 2 1,1− .

Power and sample size calculation
An integral step in the planning of a clinical trial is the determination of the sample size. When covariates are to be incorporated in the analysis, it is important to consider them in the sample size determination, even though they are not known prior to the trial, since they can affect the statistical power of the trial. There are two common approaches to define the power and the respective sample size to account for covariates in sample size planning, a conditional and an unconditional approach. The conditional power is the power of the statistical test for a given set of the explanatory variable and the unconditional power is the expected value of the conditional power over the distribution of the explanatory variable. 39 In practice, the unconditional power is of greater interest than the conditional power. However, versions of the conditional power can be used to approximate the unconditional power as we will see below.
For both the conditional and unconditional approaches, we need to approximate the distribution of the test statistic under the alternative. Therefore, we start with approximating the distribution of the Wald statistic T W and the LR statistic T LR for a parameter 1 ≠ , conditional on the covariates. Let H 1 be the parameter vector with 1 as a second entry. Under the alternative hypothesis, the distribution of the Wald statistic T W can be approximated by a noncentral 2 distribution with one degree of freedom and noncentrality parameter The distribution of the LR statistic T LR under the alternative hypothesis can be approximated by a noncentral 2 distribution with one degree of freedom and noncentrality parameter where  0 denotes the likelihood function restricted to the null hypothesis.
In the following, we present two approaches for approximating the unconditional power based on which the sample size can be calculated using iterative methods.

Expanded dataset
Lyles et al 5 proposed to approximate the unconditional power by a conditional power for a given set of explanatory variables x 1 , … , x n on the basis of an expanded dataset (ED). In the following, we briefly recapitulate this approach for the negative binomial regression model. We assume the parameters and to have certain values under the planning alternative and denote these as H 1 and H 1 . An ED is created with J entries for each fictitious explanatory variable where for each value j = 1, … , J the weights are calculated assuming H 1 and H 1 are true. The parameter J is chosen sufficiently large such that the weights sum up to approximately one for each i = 1, … , n: For example, one might select J such that the above sum is no smaller than a fixed threshold , for example, = 0.999. The ED consists of n × J tuples (y j , x i , w ij ) with i = 1, … , n and j = 1, … , J.
After setting up the ED, the weighted log-likelihood is maximized with respect to and . We denote the resulting maximum likelihood estimators bŷE D ,̂E D , and the effect size, equal to the second entry of̂E D , bŷE D . As stated in section 2.2 in Lyles et al, 5 the maximum likelihood estimators are equal to the assumed parameters H 1 and H 1 and, more importantly, the inverse observed FI I (̂E D ,̂E D |X) −1 , calculated from the ED, can be used as an estimator for the matrix Var(̂E D ). Specifically of interest for us is the variance of the effect size, that is, Var , which is estimated by taking the corresponding entry from the inverse observed FI, that is, for its true value, the approximated noncentrality parameter W,ED for the distribution of the Wald statistic under the alternative is given by: , and the approximated statistical power of the Wald test (WT) is given by where 2 1 (̂W ,ED ) denotes a random variable following a noncentral 2 distribution with one degree of freedom and noncentrality parameter̂W ,ED . Since the ED and, consequently, the corresponding approximate statistical power depend on the sample size n used for creating the ED, the sample size required to obtain a prespecified target power for the WT must be determined iteratively, for example, by using the bisection method. When determining the sample size required for the LR test, one would proceed similarily by approximating the noncentrality parameter LR through the −2 weighted log-likelihood values from the unrestricted fit and the fit restricted to the null hypothesis.
The performance of this procedure relies on sufficiently large J to be obtained. 5 However, in some situations, J can become very large to reach the threshold , resulting in a large ED and high computation times. To reduce required computational resources, we replace w ij byw ij = P , where an individual value y j is replaced by a set of c(i, j) values. For instance, c(i, j) is chosen such thatw ij is larger than a set value, for example, 0.001. In the ED, we replace y j byỹ j , whereỹ j is a representative of the set {y j 1 , … , y j c(i,j) }, for example, the median or the value associated with the largest w ij . We refer to this procedure as the binning procedure. For a numerical implementation of this algorithm, we refer to the R-code submitted as supplementary material.

Expected information
For a given set of explanatory variables x 1 , … , x n , the sample size of the WT and the LR test in the negative binomial regression model can be expressed in dependency of the FI matrix I ( , |X). Fisher's information matrix I ( , |X) is defined conditional on the explanatory variable, which we will emphasize by referring to it as the conditional information. In the following, we provide the sample size formula for the negative binomial regression based on the conditional information. Then, we illustrate how the unconditional power can be attained by numerical approximation. For conditional power, assuming given parameters H 1 and H 1 in the alternative H 1 and a given set of explanatory variables x 1 , … , x n , under the normal approximation of the Wald statistic T W from Equation (1), the total sample size of the WT in the negative binomial regression model to yield a power of 1 − can be approximated by The explanatory variables x 1 , … , x n , which include baseline covariates, are not available when planning a clinical trial, and therefore, it is not possible to calculate I ( H 1 , H 1 |X) −1 at the planning stage of the clinical trial. Instead, we focus on calculating a sample size n for which the unconditional power holds. Using the normal approximation of T W , it follows that the unconditional power is equal to As only the FI depends on X, we suggest to approximate the integral by replacing I ( H 1 , H 1 |X) by its expectation, that is,Ĩ Thus, the approximate sample size required to yield an unconditional power of 1 − with the WT in the negative binomial regression model is given by

PROPOSED PROCEDURE FOR BSSR
To test the null hypothesis H 0 ∶ = , we introduced two tests, namely, a WT (Section 3.2.1) and an LR test (Section 3.2.2). For approximating the unconditional power of these tests under a given alternative, we also introduced two approaches, using the ED (Section 3.3.1) and using the expected information (Section 3.3.2) . One possibility, described in Section 3.3.1, is given by creating an ED. This approach requires prior information on nuisance parameters and , as well as the support of the covariates, to create an ED, from which the power of the WT or LR test can be calculated for a given sample size. Calculating the sample size for a specific power is then possible by iterating the sample size until the target power is met. We will refer to these sample size calculation methods for the WT and LR test, using the ED, as WT-ED and LR-ED, respectively. The second possibility for calculating the power or sample size, as described in Section 3.3.2, is given by calculating the expected FI matrix. Based on assumption on and , as well as the distribution of the covariates, the expected FI can be calculated numerically and plugged into the sample size formula (1). This approach is applicable to the described WT. We will refer to this sample size calculation method as WT-FI. The aim of this section is to demonstrate how these three methods for sample size calculation can be applied to construct BSSR procedures. Assume we have carefully considered a suitable time point for the sample size review. At this time point, the gathered data are still blinded to maintain trial integrity. We therefore observe outcomes y 1 , … , y m with corresponding explanatory variables x 1 , … , x m . As the trial is still running, follow-up times t 1 , … , t m are likely to vary as recruitment is usually gradual over time. It is likely that some patients have not completed follow-up, and therefore, only partial follow-up information is available for these patients. However, these incomplete observations should be included in the sample size review, as the resulting sample size is expected to be less variable. 21,40 All three sample size estimation methods, WT-ED, LR-ED, and WT-FI, are based on assumptions on the nuisance parameters and . Therefore, for a BSSR of the sample size, we estimate these nuisance parameters from the blinded data. Further relevant information required for the estimation of nuisance parameters is the intended length of the study T and the allocation ratio r = n E ∕n C , where n E and n C denote the initial sample sizes of the experimental treatment and control group, respectively. As the group allocation is unknown, an observation from the blinded dataset follows a mixture distribution of two negative binomial distributions. Assuming that the true treatment effect 1 holds, a blinded estimator of ( 1 ) = ( 1 , 1 , 3 , … , p ) ⊤ (regression coefficients with fixed effect size 1 ) and is given by maximizing the likelihood of a mixture distribution of two negative binomial distributions, with known weight r, which is given by By maximizing the likelihood (0), blinded estimateŝb lind and̂b lind for the nuisance parameters ( 1 ) and are obtained. These are then used to recalculate the sample size by plugging in̂b lind for and̂b lind for in the corresponding sample size calculation method. For the methods WT-ED and LR-ED, the required sample size to yield the target power, given the blinded estimates, is calculated iteratively. Using the bisection method, for example, the following steps are necessary to obtain the reestimated sample size.
1. Choose an initial maximum total sample size n max and minimal total sample size n min . 2. Set n = (n max + n min )∕2 and create an ED plugging in̂b lind and̂b lind for and to calculate weights ij . 3. From the ED, attain maximum likelihood estimatorŝE D ,̂E D ,̂E D , andVar(Ê D ) to calculate the noncentrality parameters W,ED or LR , depending on the desired test. 4. Calculate the power of the WT or LR test for n subjects, using the noncentrality parameter W,ED for the WT or LR for the log-ratio test. 5. If the calculated power is higher than the desired power, set n max ∶= n, else set n min ∶= n. 6. Iterate steps 2-5 until the calculated power is approximately equal to the desired power.
For the method WT-FI, the calculation of the reestimated sample size is computationally less expensive and can be summarized in the following steps: 1. Calculate the expected FIĨ ( , ) plugging in̂b lind and̂b lind for and . 2. Plug in the calculated expected FIĨ ( blind , blind ) for the true FI I ( H 1 , H 1 ) into the sample size formula (4) to attain n.
Having followed these steps, the total sample size in the study can now be adjusted to be n, where the group specific sample sizes are n E = n⋅r 1+r for the treatment group and n C = n 1+r for the control group. When calculating the expected FI, the covariates distribution is also required. However, because the covariates distribution is assumed to be equal over TA B L E 1 Simulation settings for the comparison of type I error rates between the fixed design and the blinded sample size reestimation procedures all treatment groups, a maximum likelihood estimation of the parameters of the covariates distribution on the blinded data is possible and sufficient. The reestimated sample size should maintain the power even when initial assumptions on nuisance parameters prove to be wrong. This is shown in the following section.

OPERATING CHARACTERISTICS OF THE BSSR PROCEDURES
In this section, we investigate the influence of a BSSR on type I error rate, as a regulatory requirement of adaptive designs is the control of the type I error rate, and study the power to see if the BSSR methods can mitigate overpowering and underpowering when the nuisance parameters are misspecified. The binning procedure, described in Section 3.3.1, is only applied in selected settings, that is, Appendix A.1.

Type I error rate
Statistical analyses performed within a running trial, which may result in changes to the trial design, can impact the type I error rate. To investigate whether the BSSR procedures influence the type I error rate, we conducted a simulation study including all three reestimation methods introduced in Section 4, under a variety of simulation settings displayed in Table 1. All simulations are conducted with one normally distributed covariate, which is standardized for the BSSR using sample mean and standard deviation (SD). Group sizes are assumed to be equal and total observation time is one, that is, r = 1 and T = 1. For calculating the sample size of the fixed design, the assumed values from Table 1 were considered. The required sample size for rejecting the null hypothesis H 0 ∶ = 0 under the alternative 1 = 2 = −0.2 with a power of 80% at level = 5% for each of the three methods is n WT-FI = 379, n WT-ED = 381, and n LR-ED = 383, per group. Data were then simulated with values differing from the ones assumed for the sample size planning. First, covariates of the first half of the planned sample were generated from a standard normal distribution. Then, 25% of the initial sample sizes observations were generated as complete observations, before a further 25% of the initial sample sizes observations were generated with random follow-up times t i , uniformly distributed on [0, T], to simulate varying follow-up times. The generated data were then taken to calculaten WT-FI ,n WT-ED , andn LR-ED . Having calculated the reestimated sample size, the 25% data with incomplete follow-up times were completed and further observations were generated accordingly.
Inference was then performed on n WT-FI , n WT-ED , and n LR-ED observations per group for the fixed design andn WT-FI , n WT-ED , andn LR-ED observations per group for the designs with BSSR. The results are visualized in Figure 2.
The mean and median type I error rate in the three methods, summarized over all settings depicted in Figure 2, were 0.046 and 0.050, respectively. SD of the type I error rate was 0.015 in all three methods. These summary results were equal compared between the fixed design and the BSSR procedure, up to the third decimal. Therefore, we conclude that the BSSR does not inflate the type I error rate and can be used without the need of a type I error rate adjustment.

Power
To demonstrate the benefits of a BSSR, we conduct a simulation study that compares the power of the fixed design and the power of a sample size adjustment, under the settings displayed in Table 1, with the only difference being that the true log rate ratio is equal to the assumed log rate ratio. The observed covariate is assumed to be from a normal distribution and standardized prior to the statistical analysis. Assumptions on the duration of the trial, sample size allocation, and  Fixed design BSSR at interim 5%− and 95%−Quantile covariate distribution are the same as in Section 5.1. Therefore, the sample sizes for the fixed design are also equal, that is, n WT-FI = 379, n WT-ED = 381, and n LR-ED = 383 per group. The data of the simulation study are generated analogously as described in Section 5.1. The results of the comparison of the fixed design with BSSR procedures are shown in Figure 3. Figure 3 compares the sample size and implied power of the fixed design and BSSR for the WT-FI procedure. Apart from a small set of simulation settings, the results from WT-ED and LR-ED are very similar and therefore displayed in Figures A1 and A2 within the Appendix. We can conclude that BSSR corrects the sample size when the true parameters differ from the assumed and the reestimated sample size maintains 80% power, while the sample size from the fixed design does not achieve the desired power when true parameters are different from the assumed. In the regarded settings, these changes are visible for misspecifications of the intercept rate 1 , covariate rate 3 , and shape parameter . However, the impact of these parameters is specific to the setting and may be different for other parameter combinations. In addition to changes of the nuisance parameters, the effect size may also be misspecified in the BSSR. While the presented BSSR procedure does not correct the sample size due to a misspecified effect size, a possible misspecification did not result in any relevant influence on the sample size. Simulation results are reported in Appendix A.3. Notes: Note that the relevant treatment effect 2 is not estimated from the data, but defined by the user.

EXAMPLE STUDY REVISITED
The results of the negative binomial regression model applied to the double-blinded randomized study by Leppik et al, 34 who compare proabide and placebo in patients with epilepsy, are displayed in Table 2. For illustration, we consider one normally and one binomially distributed covariate. Ignoring the covariate effects (on the left hand side of the table), the treatment effect is quite small (rate ratio equal to 0.93), resulting in a large corresponding p-value. When the baseline seizure counts on the log-scale and the age category (< 30 years vs ≥ 30 years) are included in the model, it becomes evident that the patient history has a large effect on outcomes. However, the treatment effect, which is now defined conditional on the covariates, is also larger (rate ratio equal to 0.77), and the corresponding p-value becomes smaller. Taking into account the standard errors of the treatment effect estimates, differences between the marginal and conditional treatment effect are not as large anymore, as confidence intervals ([−0.569, 0.415] and [−0.558, 0.022], respectively) overlap. The included covariates explain some of the variance, which is reflected on the one hand in a smaller AIC, which decreases from 538.2 to 472.3, and on the other hand, in a larger shape parameter in the conditional model, which increases from 1.11 to 3.69. Leppik et al 34 concluded that there was no statistically significant difference between proabide and placebo regarding the seizure rates and that proabide is not a potent antiepileptic drug in the investigated population. For illustrative purposes, let us assume that the example study is an IPS of a phase III trial and the collected data are used for a BSSR. This means that we cannot conduct an unblinded analysis as the one presented in Table 2, as the data would still be blinded. We merely have the number of seizure counts and pretreatment seizure counts for each patient but not the treatment group allocations. Assuming a treatment effect of a log rate ratio of −0.2 (resulting in a rate ratio of 0.82), the nuisance parameters can be estimated in a blinded manner. Table 3 displays the estimated nuisance parameters for all different methods.
With the nuisance parameter estimates from Table 3, a sample size of about 128 patients per group is estimated with the WT-FI approach to prove the assumed effect with a power of 80% at two-sided level = 5%. Not considering covariates, using the method from Friede and Schmidli, 23 we would have required a sample size of about 353 patients per group.

DISCUSSION
Negative binomial regressions are popular to model count outcomes in clinical trials for various reasons including the following. First, between-patient heterogeneity in the susceptibility to the event of interest is modeled through a Gamma mixture distribution of individual rates. Second, varying follow-up times can be accommodated by including the logarithms of the follow-up times as an offset into the model. Third, baseline covariates can be easily included into the model by linking the rates through the natural logarithm to a linear predictor. Sample size calculations for trials with negative binomial regressions as primary analyses are often based on guestimates of the nuisance parameters, that is, control or overall rate, shape parameter, and regression parameters of the covariates, as information on these parameters is often scarce in the design phase. This is partly due to poor reporting of some of the relevant nuisance parameters. 41,42 Here we suggested BSSR procedures based on different methodological approaches. It turned out that these approaches lead to very similar results in practice. They all have in common (a) that they did not inflate the type I error rate in any practically relevant way under the null hypothesis and (b) that their power was close to the target under the planning alternative.
In the presented procedures, the blinded data are treated as observations from a mixture distribution, restricted to the alternative hypothesis 2 = 1 , and assuming that the allocation ratio r is known. This approach has been used before, for example, by Asendorf et al 40 and Schneider et al. 22 The approach is an alternative to the so-called lumping approach, 43 employed by Friede and Schmidli 23,24 as well as Schneider et al. 21 The latter treats the blinded data as if it were from one common negative binomial distribution, which is also feasible by assuming the expected efficacy parameter holds. Both approaches have been shown to differ only marginally. 40 A discussion of the mixture and lumping approaches applied to count data is also included in Friede et al. 44 There were no notable differences between the three procedures, apart from a specific number of settings. For settings in which the covariate parameter was large ( 1 = 0, 2 = −0.2, 3 ≥ 2.5, and = 3), the procedure by Lyles tended to overestimate the sample size. This is due to computational limitations encountered with this procedure. During the simulation procedure, it became evident that, to adequately calculate the power within these settings, J in Equation (2) would need to be increased substantially, say from 1000 to 10 000 or even 50 000. This drawback was overcome by the proposed binning procedure, which reduces the computational effort considerably and at the same time improves the results. To further stabilize numerical calculations, we also considered standardizing the normally distributed covariate. Leaving covariates on their original scale in theory will not change the results from the BSSR, but may lead to numerical instability when inverting the expected FI in Equation (4), in particular when the parameter values are of different order.
The considered model and the proposed BSSR procedures could be extended in various ways. For instance, time-varying rates are not uncommon; see, for example, Nicholas et al 45 for an example in relapsing MS or Mütze et al 46 for an example in paediatric MS. Furthermore, the assumption of a common shape parameter across treatment groups is the default in standard software, but might need to be relaxed in some applications. Also, alternatives to including the pre-trial event rate as covariate have been discussed. 47

A.3 Misspecification of the effect size
An additional simulation study was conducted to investigated whether a misspecified effect size has an influence on the reestimated sample size. The BSSR does not reestimate the effect size. Therefore, the BSSR procedure cannot be expected to adjust the sample size to meet the predefined power. The results are given in Figure A3.
In Figure A3, we consider the same settings as described in Section 5.1. All nuisance parameters are equal to the assumed nuisance parameters. The effect size, however, is different to that of the assumed effect size and ranges from −0.3 to −0.1 in steps of 0.01. The simulation reveals that under a misspecfied treatment effect, the BSSR performs similar to the fixed design. All settings were simulated 10 000 times. Fixed design BSSR at interim 5%− and 95%−Quantile F I G U R E A3 Sample size comparison of the fixed design and blinded sample size reestimation for the methods WT-FI, WT-ED, and LR-ED. The simulation settings correspond to Table 1, where nuisance parameters are kept equal to the expected but the effect size is altered sequentially [Colour figure can be viewed at wileyonlinelibrary.com]