Permutation‐based multiple testing corrections for P‐values and confidence intervals for cluster randomized trials

In this article, we derive and compare methods to derive P‐values and sets of confidence intervals with strong control of the family‐wise error rates and coverage for estimates of treatment effects in cluster randomized trials with multiple outcomes. There are few methods for P‐value corrections and deriving confidence intervals, limiting their application in this setting. We discuss the methods of Bonferroni, Holm, and Romano and Wolf and adapt them to cluster randomized trial inference using permutation‐based methods with different test statistics. We develop a novel search procedure for confidence set limits using permutation tests to produce a set of confidence intervals under each method of correction. We conduct a simulation‐based study to compare family‐wise error rates, coverage of confidence sets, and the efficiency of each procedure in comparison to no correction using both model‐based standard errors and permutation tests. We show that the Romano‐Wolf type procedure has nominal error rates and coverage under non‐independent correlation structures and is more efficient than the other methods in a simulation‐based study. We also compare results from the analysis of a real‐world trial.


INTRODUCTION
For a randomised controlled trial, the requirement to state a single primary outcome has become accepted, even required, practice.For example, the influential CONSORT statement on clinical trials requires the pre-specification of a single primary outcome, which they describe as the "outcome considered to be of greatest importance to relevant stakeholders", and recommends against multiple primary outcomes. 1The reason for this is to ensure appropriate control of the "false discovery rate" when using null hypothesis significance testing. 2 If there are multiple outcomes each with their own associated treatment effect being tested separately, then we are implicitly testing a family of null hypotheses against an alternative that at least one of them is false.Without correction, the type I error rate for this family of null hypotheses will be much greater than the nominal rate of any single test. 3Indeed, the CONSORT statement notes that that multiple primary outcomes are not recommended as it "incurs the problem of multiplicity of analyses". 4luster randomised trials are a widely used method to evaluate interventions applied to groups of people, such as clinics, schools, or villages.Often these interventions target 'higher level' processes and can be complex in nature. 5,6,7Recent examples from our own work include an incentive scheme to improve implementation of a broad package of education and activities designed to improve employee health in the workplace, 8 or a community health worker programme targeting multiple health conditions. 9The effects of such complex interventions cannot be adequately summarised by a single outcome.Creating a composite outcome is undesirable since it requires applications of arbitrary weights across outcomes and discards information by collapsing a multivariate outcome to a univariate one.The requirement for a single primary outcome therefore clashes with the needs of many cluster randomised trials.The solution is to ensure appropriate methods are used where there are multiple outcomes of interest rather than restricting the outcomes from which we can make inferences.However, the question of appropriate analysis for randomised trials, and particularly cluster randomised trials, with multiple outcomes can be contentious and complex.
The Food and Drug Administration (FDA), the main regulatory body for medicines in the United States, declares that "If the purpose of the trial is to demonstrate effects on all of the designated primary variables, then there is no need for adjustment of the type I error". 10They also identify a "gatekeeping" approach where "statistical significance" on a primary outcome is required before a second one can be analysed and state this does not need correction for multiple testing.Other authors differentiate aiming to declare "statistical significance" on at least one of a group of null hypotheses to requiring statistical significance for all tests in order to reject any individual test, and propose different solutions for both. 11,12here a correction for multiple testing is deemed necessary, we can divide solutions into: (i) multivariate methods that model the joint distribution of the outcomes, which is particularly favoured by Bayesian practitioners; 13 and (ii) univariate solutions that aim to ensure inferential statistics for a set of estimands collectively have the appropriate Frequentist properties. 14In this article, we focus on the latter approaches in a Frequentist setting.Despite the different approaches and guidance, Wason et al 2 estimated that only around half of all randomised trials with multiple outcomes or arms corrected for multiple testing.No evidence is available on the use of corrections for multiple testing in cluster randomised trials specifically, but there are few, if any, comprehensive discussions of methods in this area currently available.Furthermore, almost all discussion of multiple testing adjustment relates to corrections for p-values, with few, if any, solutions for confidence intervals.The FDA note that correcting confidence intervals is complex and beyond the scope of their advice.However, the duality between hypothesis testing and confidence intervals means that we should be able to identify the bounds of a 'confidence set' adjusted for multiple testing. 3,15he primary limiting factor to using corrected confidence intervals is that there are no proposed methods for determining these bounds efficiently.
In this article, we develop several methods for adjusting p-values for multiple testing for a cluster randomised trial setting using permutation-test based methods, by adapting existing methods of correction, and propose a novel method to derive corrected confidence sets.We then compare these methods in a simulation-based study to evaluate type I error rates and efficiency of the different procedures.Our analysis is based on generalised linear mixed models, which are frequently used in the analysis of cluster trials.We also focus on permutation-based methods, since these methods provide exact inference at all sample sizes.A small number of clusters, which is common to many cluster trials, can result in small sample biases in the standard error estimator and inflated type 1 errors, 16,17,18 which results in complication when it comes to considering additional corrections for multiple testing.Section 2 provides a review and discussion of methods for correcting for multiple testing and their adaptation to a cluster randomised trials setting, Section 3 presents a simulation-based comparison, Section 4 provides an applied example, and Section 5 concludes.

The multiple testing problem
We first suppose that data are generated from some probability distribution , which belongs to some family of probability distributions Ω.The family Ω could be a parametric, semi-parametric, or non-parametric model.The multiple testing problem arises when we have a set of hypotheses versus ′ for = 1, ..., , following the notation of Romano and Wolf. 3 These hypotheses in our context are typically estimates of the treatment effect of an intervention on multiple outcomes.Each of the hypotheses is a subset ⊂ Ω and is equivalent to testing ∈ against ∉ .So for any subset ⊂ 1, ..., , = ∩ ∈ is the hypothesis that ∈ ∩ ∈ .We assume each null hypothesis is based on a test statistic ; we denote the -quantile of the distribution of as ( , ).In a traditional null hypothesis testing framework we "reject" in favour of ′ at the level, if ≥ (1 − , ), which clearly has probability .Conversely, the p-value of the test is where = (1 − , ), so that the probability of observing ( > (1 − , )| ) = .The family-wise error rate (FWER) of this set of hypotheses is the probability of "rejecting" at least one true null hypothesis.That is, if = ( ) ⊂ 1, ..., are the indices of the true null hypotheses, so ∈ if and only if ∈ , then the FWER is the probability under of rejecting any ∈ , i.e. (∪ ∈ > (1 − , )), which should be .

Methods for correcting for multiple testing
Solutions to the multiple testing problem aim to ensure that ≤ .Control over the FWER is said to be strong if it holds for any combination of true and false null hypotheses, and weak if it only holds when all null hypotheses are true. 19Several approaches exist to control the FWER.The Bonferroni method is probably most well known, which sets the critical value for the test of the null hypothesis to be (1 − ∕ , ).Equivalently, p-values that maintain the FWER for the family of null hypotheses ensure that (∪ ∈ > (1 − , )) = , so a crude 'corrected' p-value for the null hypothesis using the Bonferroni method would be min( , 1).However, while this method exerts strong control over the FWER, it is highly conservative.Holm 20 proposed a less conservative 'stepdown' approach to multiple testing.One orders the test statistics from largest to smallest and then compares the largest statistic to the critical value (1 − ∕ , ).If the test statistic is larger than this value, then the null hypothesis is rejected, otherwise we do not reject any null hypothesis and stop.If we rejected, then the next largest test statistic is compared to (1 − ∕( − 1), ), and again it is either rejected, or we do not reject all remaining null hypotheses and stop, and so forth.A crude corrected p-value could therefore be obtained by multiplying the smallest to the largest p-values by , − 1, etc, respectively.The Holm method is less conservative than the Bonferroni method, 20 but it may still be inefficient as, like the Bonferroni method, it does not explicitly take into account the dependence structure in the data.Romano and Wolf 3,15 developed an efficient resampling based version of Holm's stepdown method, which can use permutation-based tests in the context of a cluster randomised trial.

Permutation-based corrections for multiple testing
An issue that complicates analyses of cluster randomised trials is that test statistics can fail to have the expected sampling distribution in a range of circumstances, but particularly when the number of clusters is small. 21,16,18,17This issue means determining the critical value of a hypothesis test, even in the absence of any multiple testing issue, can be difficult.While there exist several small sample corrections in the literature their performance often depends on the correlation structure, which is not known. 16,18n alternative approach is to use a permutation testing method based on the randomisation scheme for the trial.In particular, the null hypothesis implies that the distribution of the data is invariant under a set of transformations in , which has elements.So, and have the same distribution for all ∈ whenever has distribution ∈ . in the context of cluster randomised trials is the set of all transformations that could be generated by the randomisation mechanisms, for example, all ways of dividing the clusters into two groups for a parallel design.Our observed test statistics with our sample data are ( ).The test statistic generated by the th permutation is ( ) for ∈ and = 1, ..., .We can use this approach to estimate the critical values for the Bonferroni or Holm corrections.For example, for Bonferroni: where ,| (1− ∕ )| is the (1 − ∕ )th (or nearest integer) largest value from the permutations.And a crude, corrected two-sided p-value is: where is the indicator function and abs is the absolute value.The same approach can be used for the Holm method.Romano and Wolf 3,15 developed a modified stepdown approach to take advantage of resampling methods.Their process is optimal in a maximin sense.We describe the general stepdown procedure of Romano and Wolf firstly in terms of accepting or rejecting each null hypothesis at an -level.We let ( , ) denote an -quantile of the distribution of the statistic: for any subset of null hypotheses .We also denote For each permutation we can determine the test statistic as in Equation (3) as , = max ∈ ( ).As before we denote ,| | as the th largest of all the permutational test statistics { , ; = 1, ..., }.Then our estimator for the critical value is: We can see how this procedure produces p-values for a two-sided hypothesis that also maintains the FWER for a given 22 , in particular: For a one-sided test we would not use the absolute values of the test statistics.
Often the size of can be very large, and increases exponentially with the number of clusters.A Monte Carlo approach can be used that instead generates a random subset of of fixed sized in order to generate realisations of the test statistics.If we conduct such permutations then the estimator of the p-value for a given null hypothesis versus some alternative is Obtaining p-values in this way is described in detail by Romano. 22Values of = 1, 000 or greater are often used as this results in relatively small Monte Carlo error, although much larger values (e.g.10,000 or 100,000) may be preferred for formal or final analyses.
In subsequent sections, we develop and compare Bonferroni, Holm, and Romano-Wolf methods, however, we note there are several other multiple testing corrections in the literature, including Hochberg's 'step-up' procedure, 23 Hommel's 'stagewise' procedure, 24 and Šidák's procedures 25 (see also 14 for a discussion).More exhaustive comparisons of these methods in other settings, such as 26,27,28,29 , show that they all maintain a FWER ≤ , but that Holm's, Hommel's, and Hochberg's procedures generally are the most efficient and perform very similarly.However, these comparisons do not include the Romano-Wolf method, which purports to be at least as efficient as Holm's procedure. 3We note that Westfall and Young 30 propose an early version of a resampling based multiple testing correction similar to Romano-Wolf, which is included in the comparison by Alberton et al 29 in the context of modelling brain imaging data.We adapt only a subset of all methods, but believe the application of other methods in the context we describe below, including any developed after the publication of this article, should be clear from the discussion of these four key approaches.

Permutation test statistics for cluster trials
We next introduce a generalised linear mixed model commonly used in the analysis of cluster randomised trials (e.g. 31 ).We denote as the outcome of the th individual, = 1, ..., , in cluster = 1, ..., at time = 1, ..., .We include a temporal dimension in this discussion for generality, however, it can be ignored as required.Our simulation-base comparisons include both examples with and without a temporal dimension.We do not restrict the outcome, it could be continuous or discrete.We specify the linear predictor: where is an indicator for whether cluster has received the intervention at time and so is the parameter of interest, our "treatment effect".We also have a vector of individual and/or cluster-level covariates, , which may also contain temporal fixed effects.The parameter represents a general 'random-effect' term that captures the within cluster and cluster-time correlation, although we do not provide a specific structure here.The overall model is then where ℎ(.) is a link function.For example, could be a Binomial distribution and ℎ(.) the logistic link function.
Gail et al 32 provided the first extensive examination of permutation tests for cluster-based study designs.Their work principally used unweighted differences of cluster means as the basis of permutation tests (see also 33 ).Several other authors have also developed and evaluated permuatation-tests and test statistics in the context of cluster trials. 34,35,36,37,38,39Here, we build on the statistic proposed by Braun and Feng 40 .
Braun and Feng 40 examine optimal permutation tests for cluster randomised trials specifically.They derive a 'quasi-score' statistic using the marginal likelihood of the data modelled separately from the correlation structure of the data.The marginal mean of each observation, ignoring the cluster-effects , is The "quasi-score" statistic, which is weighted sum of generalised residuals, is then: where ) vector of modified intervention indicators equal to 1 if the intervention was present in cluster at time and -1 otherwise, and where = ∑ and is the number of individuals in cluster at time .is a (1 × ) vector with elements ( ℎ −1 ∕ ) −1 , and is an ( × ) covariance matrix for cluster with non-zero elements off its diagonal.As an example, if we assume the data are normally distributed with mean , identity link function, variance 2 , and ∼ (0, 2 ), then the diagonal elements of are 2 + 2 and the off-diagonal elements are 2 .More complex structures might include temporal decay in correlation, for example.We use Θ to represent the parameters of the variance-coviarance matrix.Finally [ − ] are generalised residuals: ] is a (1 × ) vector of outcomes and is a (1 × ) vector of means.For the permutation test to be valid the 'nuisance' parameters ( , , Θ), i.e. those other than , must be invariant to permutation. 40This means we cannot re-estimate them for each new permutation.In practice the maximum likelihood estimates of these parameters are used to construct the test statistic, so that we use the estimates: for the linear predictor under the null 0 ∶ = * .Estimating Θ is more difficult, however, particularly when the number of clusters is small. 21,16As an alternative to (11) we can replace −1 with a (1 × ) vector of ones: so that the sum of residuals is 'weighted' only by the size of each cluster or cluster-time period.One can see that under homoscedasticity the two test statistics will be approximately proportional.The weighted statistic weights the residuals in proportion to their variance, so in non-linear models with differing variances (e.g.different linear predictors over time) we may expect to see an improvement in efficiency.The quasi-score statistics are the motivation behind quasi-likelihood approaches, including GEE methods. 41,40Thus, the tests and corrections described here can be implemented within a GEE framework.However, our simulations in Section 3, model estimation, and the software we provide to implement the methods uses a more explicitly GLMM formulation.The quasi-score statistic is equivalent for full and marginal likelihoods using linear Gaussian models, or when using the 'unweighted' variant described below.For non-linear alternatives though, the quasi-score statistic is an approximation to the full likelihood.In terms of our implementation of the computation of (11), we use a GLMM formulation (Equation 8) and use the original estimates of the covariance parameters to generate an estimated inverse covariance matrix ̂ −1 , which is then re-used for each iteration.
For the purposes of correcting for multiple testing we use studentized versions of the two test statistics: where the terms on the right-hand side have been evaluated at = * .We describe as the "weighted test statistic" and as "unweighted".In the absence of studentization, the variances of the test statistics are not scale-free and depend on, among other things, the null hypothesis being tested so that different tests will have different power. 3The lack of balance is particularly consequential for the construction of confidence sets discussed in the next section.While confidence sets constructed on the basis of permutational methods will have joint coverage of 1 − , without balance the individual coverage probabilities of each interval will differ, perhaps substantially. 15

Confidence sets and multiple testing
The multiple testing problem extends to the construction of simultaneous confidence intervals or a "confidence set".Let the parameters of interest be with associated confidence intervals [ , ], so that forms a confidence set.Similar to the FWER, we want appropriate control of the coverage of the 100(1 − )% confidence set such that the process produces confidence sets with the property: we refer to this as 'family-wise coverage', which we use analogously to 'simulatanous coverage' used in other contexts.If we construct 100(1 − )% confidence intervals independently then the probability that at least one interval in the set excludes the true value can significantly exceed .For Bonferroni, an obvious modification is to instead estimate 100(1 − ∕ )% confidence intervals to acheive a family-wise coverage of 100(1 − )%.There have been some attempts to construct exact confidence sets for parameters analytically based on the stepdown procedure. 15For example, Guilbaud, 42 extending the proposal of Hayter and Hsu, 43 uses the acceptance/rejection of null hypotheses by the stepdown procedure as a basis of determining upper or lower limits of confidence intervals if we conclude they are strictly negative or postitive, respectively.However, these procedures can only provide information on the upper or lower bound respectively -the other end of the interval is infinity -so they provide little extra information on the extent of sampling variation beyond the p-value.
As an alternative, consider for a moment, a single parameter 1 .Its 100(1 − )% confidence interval is [ 1 , 1 ]: for any value * 1 inside this interval the null hypothesis 1 ∶ 1 = * 1 will not be rejected in favour of the two-sided alternative 1 ′ ∶ 1 ≠ * 1 at the level.The question is then how to find the values of 1 and 1 efficiently.One could iteratively perform a series of permutation tests to identify the limits as 1 = sup{ * 1 ∶ do not reject 1 = * 1 } and 1 = inf{ * 1 ∶ do not reject 1 = * 1 }.However, this procedure is inefficient, particularly when testing multiple parameters: if there are permutations per test and outcomes, then for each increment in we must calculate permutation test statistics and perform the desired correction.Moreover, since the test statistic and its permutational distribution depends on the values of the other null hypotheses being tested, a very large number of combinations of values of the parameters must be tested to ensure we have identified with reasonable certainty the limits of the confidence set.
Garthwaite and Buckland 44 developed a method for searching for confidence interval endpoints efficiently, which Garthwaite 45 later adapted for use with permutation tests.Their method is based on the search process devised by Robbins and Munro, 46 who developed a stochastic approximation procedure to find the -quantile of a particular distribution.Multivariate Robbins-Monro processes follow the same procedures as their univariate equivalents. 47For our multiple testing scenario the upper limits to the confidence set correspond to where all hypotheses ∶ = for = 1, ..., are all rejected in favour of the two-sided alternative with a FWER of but for any smaller values of not all hypotheses are rejected, and equivalently for the lower limits.Rabideau et al 48,49 have also independently proposed this method for confidence interval estimation for cluster randomised trials, although not in the context of multiple testing.
For each method, at the th step of steps total, we have estimates of the upper confidence interval limits of our parameters We generate the set of test statistics ( )| = , which correspond to the null hypotheses ∶ = .We then generate a single permutation of a permutation test for the same hypotheses abs( ( )) = .Each method then defines a procedure for determining whether to reject these hypotheses or not, which are described in the preceding sections.For example, with the Romano-Wolf stepdown procedure: reject hypothesis )) < abs( |2| ( )) otherwise do not reject any further hypotheses and stop, and so forth.
The estimates of the upper limits are updated based on the single permutation draw as (we drop the subscript = for ease of notation, but the test statistics are evaluated at this value for each iteration): where is the "step length constant".With no correction and with Romano-Wolf * = , for Bonferroni * = ∕ , and for Holm * = ∕ for |1| , * = ∕( − 1) for |2| , and so forth.Similarly for the lower limits, the updating rule is: The step length constants are = ( − ̂ ) and = ( ̂ − ) for the upper and lower limits, respectively, where ̂ is a point estimate of the parameter and: where is the -quantile of the standard normal distribution.The algorithm proceeds for a pre-selected number of iterations; in the simulations in the subsequent section we have used 2,000 iterations.A sensible starting value for this algorithm is the approximate uncorrected confidence interval limits, for example, for the upper limit , 0 = ̂ +2 where is the standard error of from the univariate model.

Computation
An R package developed by the authors to execute the analyses described in this paper is available from CRAN as crctStepdown (version 0.2.1 at the time of writing) including implementations of the Romano-Wolf, Holm, and Bonferroni methods for correcting p-values and confidence sets using permutation-based tests.

Methods
We conduct a simulation-based study to examine the FWER, family-wise coverage, and efficiency of the procedures outlined in the previous sections for cluster randomised trials.We compare the following procedures: 1.A 'naive' no correction approach using the reported standard errors and test statistics from the output of the lme4 package for R. 95% confidence intervals for each parameter were constructed as ̂ + ∕ − 1.96 × .
2. No correction with p-values and confidence sets derived from permutation based tests.
3. The Bonferroni method using permutation based tests.
4. The Holm method using permutation based tests.
5. The Romano-Wolf method using permutation based tests.
For methods 2-5 we use both the weighted and unweighted test statistic resulting in nine methods.For the Bonferroni and Holm methods we only use permutation-based inference rather than the perhaps more standard approach of adjusting p-values reported by mixed model fitting software.Model-based inference can fail to have nominal FWERs for reasons other than multiple testing, such as biases arising from small numbers of clusters, which would further complicate interpretation of the results.We include a comparison with methods 1 and 2 to illustrate this issue in our context.

Data generating processes
We use three different data generating processes of cluster randomised trials, described below.We opt for specific scenarios of rising complexity to examine the performance of the nine different methods (including both unweighted and weighed versions of the permutation-based methods).All outcomes are simulated and modelled using exponential-family models.In all simulations we set the number of individuals per cluster to 20 and simulate either seven or 14 clusters per arm.The choice of number of clusters is informed by two key considerations.First, the simulations take a very long time to run given the number of GLM models required to be estimated for the permutation tests and search procedures (for three outcomes and 10,000 iterations we require 90 million models), and so we aimed to choose the smallest number that would provide the desired inference.Second, we wanted to include scenarios where there was likely small sample bias in 'standard' non-permutation based estimators of standard errors due to the low number of clusters, and one where such biases were likely minimal.Previous literature on cluster trials suggests small sample biases are likely minimal at 14 clusters or more per arm, but present with seven clusters per arm [REF], although permutation-based methods provide exact inference at any sample size.We provide estimates of FWER without correction and with non-permutation based estimators to examine whether there are likely small sample biases.However, we recognise that 14 clusters per arm may still be considered 'small'.The treatment effect parameters for each simulation are a vector, , with length equal to the number of outcomes and with different combinations of either 0 or 1, allowing for when all treatment effects are zero and when only a subset are.
(1) Two-arm, parallel cRCT, two outcomes The first simulation data generating process ('model (1)') represents a two arm parallel cluster trial with two outcomes measured once in the post-intervention period.Both outcomes are continuous, Gaussian variables for = 1, 2. This model is intended to examine the effect of correlation, which we model at the individual and cluster levels.For individual in cluster : where are intercept parameters, is an indicator for whether the cluster is treated or not, and , are cluster level random effect modelled as: The parameters and are correlation parameters at the individual and cluster levels, respectively with and the standard deviation of the individual-level outcomes and cluster-random effect terms, respectively.Clusters are assigned in a 1:1 ratio with seven or 14 clusters per arm and 20 or 10 individuals per cluster.We set = 1 and consider both = (0, 0) and = (0, 0.5) to compare the FWER under different combinations of true null hypotheses.We set 2 = 1 and 2 = 0.05, which gives a marginal intraclass correlation coefficient (ICC) (ICC = ( , )∕ ( , )) of 0.05.We also set = and examine a range of values.We do not report outcomes using the weighted test statistic with this example as it is proportional the unweighted test statistic as both models are Gaussian with identity link, so there will be no difference in performance.
(2) Two-arm, parallel cRCT, two differently distributed outcomes For the next set of simulations ('model (2)') we consider a parallel cluster trial with two outcomes measured once in the postintervention period.Simulation parameters are as the previous example, unless stated below.The first outcome is specified as Poisson distributed: 1, ∼ Poisson(exp( 1 + 1 + 1, )) and the second outcome as Gaussian distributed: where the random effects are specified as in Equation ( 21) with = 0. We again set = 1 and consider both = (0, 0) and = (0, 0.5).The ICC for non-linear models depends on the realised values of the covariates and the parameter values and so will differ between simulations.We again choose 2 = 0.05, which gives a range of ICCs between approximately 0.01 and 0.2 for the Poisson model and 0.05 for the Gaussian model. (

3) Two-arm parallel cRCT with baseline measures, three outcomes
We finally extend the parallel cluster trial model ('model (3)') to include baseline measures, which incorporates a temporal dimension and hence more complex covariance structure.The trial includes seven clusters in each arm, with half receiving the intervention in the second time period.We simulate three outcomes, with index representing time period: ∼ Bernoulli(logit where, now, equals one if the cluster has the intervention in time period and zero otherwise and is a fixed effect for the second time period.We use an auto-regressive specification for , to facilitate incorporation of correlation between outcomes.In particular, for ≠ ′ .The random effects have a multivariate normal specification as before zero correlation.We maintain the same number of individuals per cluster.We set = −1 and = 1 for all = 1, 2, 3. We vary the choice of as either (0, 0, 0) or (0, 0.5, 0); as with the previous set of simulations we do not consider a completely exhaustive set of permutations of simulation parameters.We set = 0.7.

Simulation methods
Each set of simulations is run 10,000 times.We note the Monte Carlo error will be moderately higher than expected due to variation arising from the permutation tests, confidence set search procedure, and simulations.We use 1,000 iterations for the permutation test p-values and 2,000 steps for the search procedure as these produced stable values for these simulations (although we note that for more outcomes longer runs were often required for the confidence interval search procedure for it to reach a stable equilibrium).Point estimates of parameters were obtained from univariate generalised linear mixed models estimated with the R package lme4 for models (1) and ( 2), we similarly obtained estimates of variance parameters from these models for the weighted test statistics.For example (3) we obtained parameter estimates from a generalised linear model with no random effects given the lack of widely available software for estimating autoregressive random effects models; weighted test statistics were generated using a covariance matrix created with the values of , , and used in the data generating process.

Evaluation
We estimate the FWER for ≤ 0.05, which has a nominal rate of 5%, and also estimate coverage of 95% confidence sets.We also estimate the mean 95% confidence interval width for each parameter to compare the efficiency of the procedures.

Results
Figure 1 shows the family wise error rates and coverage from model (1) with the permutation-based methods for different levels of the correlation coefficient.We exclude the 'naive' approach from these plots as it has non-nominal marginal type I error and coverage without correction (see below).All three corrections ensured nominal error rates at lower levels of correlation ( ≤ 0.6), however at higher levels of correlation Bonferroni was conservative.Without correction, the FWER declined as the correlation increased but was still approximately 0.08 at = 0.8.Only the Romano-Wolf and Holm methods ensured nominal family wise coverage at any level of correlation.Figure 2 shows the 95% confidence interval width for the four methods for the same model.For the two methods with nominal or near nominal error rates (Romano-Wolf and Holm), Romano-Wolf was moderately more efficient with narrower confidence intervals.The other methods displayed approximately constant confidence interval widths, with their respective widths reflecting the coverage results.
Table 1 reports the results from model (2).Under all tested conditions the FWER was approximately nominal in all scenarios for all multiple testing corrections when both parameters were zero.However, when only one parameter was zero, Bonferroni was conservative as expected with a FWER ≈ 0.025 at = 0.05 which was also reflected in coverage being greater than the nominal rate.Romano-Wolf and Holm had nominal rates in all scenarios.Confidence interval width followed the same pattern as model (1) with Romano-Wolf generally being more efficient.Use of the weighted test statistic did not make much difference qualitatively with some confidence intervals larger and some smaller.Without correction, using a permutation test approach resulted in a FWER of ≈ 0.10 when there were two true null hypotheses, as expected.Using the naive output of lme4 resulted in even worse performance due to the small sample bias in the test statistics, also as expected 17,16 , with FWER around 30-50% higher.Table 2 reports the results from the two outcome trial simulations with a larger 20 clusters per arm.The same pattern is observed as the smaller two-arm experiments, but the small sample bias using the naive method is reduced.To illustate the computational efficiency of the procedure, a single run of the function to derive p-values and confidence sets took between 1 and 10 seconds depending on the number of outcomes and size and number of the clusters.
Table 3 shows the results from the three outcome simulations with baseline measures.Despite the more complex covariance structure and imbalance in the number of observations between control and treatment conditions, Holm and Romano-Wolf maintained nominal FWER and coverage.Again, Romano-Wolf was the most efficient correction.Its confidence intervals were between 10 and 50% larger than the uncorrected results.We also note that the uncorrected approach maintain marginally nominal rates for each univariate outcome in all scenarios.

APPLIED EXAMPLE
To provide a real-world example of the the use of the methods proposed in this article, we re-analyse a cluster randomised trial of a financial incentive to improve workplace health and wellbeing in small and medium sized enterprises (SME) in the United Kingdom.The original trial was relatively complex and included four trial arms with pre-and post-intervention observations comprising a standard control condition (no incentive), two treatment conditions (high and low incentive), and a second control arm with no baseline measures also with no incentive.The trial enrolled 152 clusters (SMEs), which were randomly allocated

Method
Test statistic FWER Coverage CI width in an equal ratio to each of the trial arms; 100 SMEs completed the trial.Up to 15 employees were sampled and interviewed from each cluster.The full protocol is published elsewhere 50 (at the time of writing the results from the trial are under review).

Outcomes
A single primary outcome was specified in the protocol, which was the question "Does your employer take positive action on health and wellbeing?".However, given the potential lack of insight it might provide into the functioning of the intervention, several secondary outcomes were specified to capture the "causal chain" between intervention and employee health and wellbeing.
For each of three separate health categories (mental, musculoskeletal, and lifestyle health) employees were asked: 1. whether the employer provided information in this area; 2. whether the employer had provided activities and services in this area; 3. whether the employee had made a conscious effort to improve in this area; 4. whether the employee had attended any groups or activities in this area at work; 5. whether the employee had attended any groups or activities in this area outside of work.
for a total of 15 outcomes.

Re-analysis
The original analysis of the trial took a Bayesian approach.The Frequentist re-analysis we conduct here is principally for illustrative purposes, and so we only take a subset of the data and simplify some of the outcomes.In particular, we take only the main  control arm and the high incentive intervention arm to estimate the effect of the high incentive.We focus on the set of secondary outcomes listed above, which we collapse into five separate outcomes; whether the employer provided information across all three health areas, and then whether there was a positive response for any of the health areas for the remaining outcomes, for a total of five outcomes.All outcomes are modelled using a Bernoulli-logistic regression model, following the notation above, with = 0 for baseline and = 1 for post-intervention: We used 4,000 permutation test iterations and 10,000 steps in the confidence interval search procedure.For illustration, this re-analysis took eight minutes on a desktop PC with Intel Core i7-9700K with 16GB RAM and Windows 10.

Results
Table 4 shows the results of an analysis using the naive method (a model-based analysis using lme4 with no multiple testing correction), alongside 'corrected' results using the Bonferroni, Holm, and Romano-Wolf methods.We first note that the convergence of the confidence interval search procedure was highly sensitive to the starting values.The algorithm could take a long time to find the right part of the parameter space, particularly since the search distance decays with the number of iterations.

FIGURE 3
Example of the confidence interval search for the lower confidence interval limit using the Holm correction for the cluster trial example.
Convergence can be assessed graphically; the chain 'osciallates' around a value at convergence compared to continuous gradual climbing or descending, Figure 3 shows an example.We make several observations about the results.The uncorrected analysis would suggest there is likely good evidence that the intervention improved employer provision of information and activities and services, and increased employee taking part at work.However, this conclusion might contradict our understanding of the causal processes since it would seem contradictory for employees to make more effort but not report making more effort.The results corrected for multiple testing using Romano-Wolf appear to be more consistent in that employers appeared to make more effort but the employees did not take up the new services with small and negative effects now shown to be compatible with the data for the latter three outcomes.The effect of the intervention is also more uncertain than suggested by the uncorrected confidence intervals.In particular, the confidence intervals under the corrected methods, which are based on exact permutation tests, are not symmetric for several outcomes, unlike under the uncorrected approach.So, smaller effect sizes, particularly for the first two outcomes, are more plausible than the uncorrected method would suggest.

DISCUSSION
We have proposed how one can estimate Frequentist statistics for cluster randomised trials with multiple outcomes that control for the FWER and coverage of simultaneous confidence intervals.These methods also apply generally in any scenario where multiple tests from GLMMs are used.Where a correction for multiple testing is desired in a cluster trial setting, the Romano-Wolf approach would be recommended as it maintains nominal rates in a variety of scenarios including with differing levels of between-outcome correlation, cluster and individual sample sizes, and covariance structures, it is also more efficient than the alternatives.Where a multiple testing correction is not desired, permutation-based methods are likely to provide marginally nominal error rates and so are also recommended when other methods may exhibit biases.We also compared a weighted test statistic based on the score statistic proposed by Romano and Wolf 3 , but did not find this provided any obvious benefit over an unweighted sum of generalised residuals.We do note, however, that while these methods do provide the desired properties, many regulatory agencies, including the FDA, do not (yet) accept statistics derived from re-sampling based methods, which may limit their application.Researchers may also consider other methods if multiple testing corrections are required such as 'intersection-union' testing. 51here have been no previous comparisons of multiple testing corrections in the context of cluster randomised trials as far as we are aware, but our results generally reflect those from other settings.For example, Ozenne et al 28 compared several multiple testing corrections for linear latent variable models, including a resampling-based procedure, although not Romano-Wolf.They showed this method maintained strong control of the FWER and was more efficient than Bonferroni.Vickerstaff et al 27 considered the question for individual level randomised trials with a linear model, and suggested that Hommel's 24 and Hochberg's 23 methods were marginally more efficient than Bonferroni or Holm, but they did not include a permutation-based procedure, not non-linear models.Alberton et al 29 also shows permuation-based methods to outperform other corrections in the context of analysing brain imaging data.
We have examined methods from a range of previous work including: permutation tests for cluster trials, 32,52 univariate methods for corrections for multiple testing that use permutation tests, 3,15,22 and procedures for estimating confidence interval limits based on permutation tests. 44,45,48,49Altogether the proposed methods can deal with several issues that are common to cluster randomised trials as they allow for multiple outcomes, they can incorporate other features such as restricted randomisation methods, which are often used in trials with a small number of clusters.Watson et al, 16 Li et al, 37,18 and Zhou et al 35 discuss permutation tests with restricted randomisation methods.Permutation-based methods provide exact inference when there are a small number of clusters, which can lead to non-nominal error rates of standard test procedures and hence confidence intervals with non-nominal coverage.Several small-sample corrections exist that can provide nominal error rates with a small number of clusters, 16,17 however there is no obvious way these would be incorporated efficiently into a multiple testing procedure.After conducting the analyses presented in this article, an updated and more efficient version of the confidence interval search procedure was brought to our attention 53 .This method improves the efficiency of the search procedure, and requires fewer steps by making larger steps on average, although would not affect the results presented here.We aim to incorporate the algorithm in our R package implementing these methods (crctStepdown).
The tools developed for this article can be incorporated at the design stage of a cluster trial to determine power using simulation-based approaches.These methods are useful for the analysis of cluster trials with multiple outcomes and the treatment effect parameters from the linear predictors of multiple univariate models, however, it is not clear how or if they could be applied to cluster trials with multiple arms.In multi-arm trials there may be one or more outcomes, but clusters may receive different 'doses' or variants of the treatment.There are a variety of treatment effects and null hypotheses of interest including pairwise comparisons between arms and a global joint null, which can be estimated from a single univariate model with indicators for each arm. 16,35Pairwise null hypotheses in these models do not make statements about the value of the treatment effects in arms outside the pair under comparison as it is left unspecified, so it is not obvious then how a permutation test could be conducted for the pairwise comparison that is invariant to randomised allocation.The multiple treatment effects of interest in a multi-arm study clearly fall in the realm of multiple testing.Nevertheless, we believe the methods proposed in this article will be a useful tool for the analysis of cluster randomised trials in many cases.

FIGURE 1
FIGURE 1 Family wise error rate and coverage under model (1) for four methods with different levels of the correlation coefficient .The dashed line shows the nominal rates and the dotted lines approximate Monte Carlo confidence intervals.'None' refers to no correction.

FIGURE 2 95%
FIGURE 2 95% Confidence interval width model (1) for four methods with different levels of the correlation coefficient .'None' refers to no correction.

TABLE 1
Results of simulation experiments with two outcomes, seven clusters per arm, and with 10,000 iterations each.Each iteration used 1,000 permutations for the permutation test and 2,000 iterations in each of the lower and upper confidence interval search processes.underlined results for FWER and coverage show those within approximated 95% Monte Carlo confidence interval of the nominal value.

TABLE 4
Results from re-analysis of the workplace wellbeing trial.Results are log odds-ratios, 95% confidence intervals, and p-values.Permutation test p-values used 4,000 iterations, and the confidence interval search procedure used 10,000 steps for Bonferroni, Holm, and Romano-Wolf (RW) methods.The 'None (Naive)' method refers to a model-based analysis using lme4 with no multiple testing correction.