Sample Size Estimation using a Latent Variable Model for Mixed Outcome Co-Primary, Multiple Primary and Composite Endpoints

Mixed outcome endpoints that combine multiple continuous and discrete components to form co-primary, multiple primary or composite endpoints are often employed as primary outcome measures in clinical trials. There are many advantages to joint modelling the individual outcomes using a latent variable framework, however in order to make use of the model in practice we require techniques for sample size estimation. In this paper we show how the latent variable model can be applied to the three types of joint endpoints and propose appropriate hypotheses, power and sample size estimation methods for each. We illustrate the techniques using a numerical example based on the four dimensional endpoint in the MUSE trial and find that the sample size required for the co-primary endpoint is larger than that required for the individual endpoint with the smallest effect size. Conversely, the sample size required for the multiple primary endpoint is reduced from that required for the individual outcome with the largest effect size. We show that the analytical technique agrees with the empirical power from simulation studies. We further illustrate the reduction in required sample size that may be achieved in trials of mixed outcome composite endpoints through a simulation study and find that the sample size primarily depends on the components driving response and the correlation structure and much less so on the treatment effect structure in the individual endpoints.


Introduction
Sample size estimation plays an integral role in the design of a study. The objective is to determine the minimum sample size that is large enough to detect, with a specified power, a clinically meaningful treatment effect. Although it is crucial that investigators have enough patients enrolled to detect this effect, overestimating the sample size also has ethical and practical implications. Namely, in a placebo-controlled trial, more patients are subjected to a placebo arm than is necessary, therefore withholding access to potentially beneficial drugs from them and delaying access to future patients. [1][2][3] Furthermore it results in longer, more expensive trials, using resources that could be allocated elsewhere.
One vital aspect of sample size determination is the primary endpoint. Typically this is a single outcome, however in some instances there may be multiple outcomes of interest and so various combinations of these outcomes can be selected as the primary endpoint, depending on the hypothesis of interest. Assuming we have three outcomes of interest ν 1 , ν 2 and ν 3 , one option is a co-primary endpoint, which takes the form of the multivariate endpoint ν 1 ∩ ν 2 ∩ ν 3 . This means that an intervention is deemed to be effective overall if it is shown to be effective in each of ν 1 , ν 2 , and ν 3 . Alternatively multiple primary endpoints may be of interest, which take the multivariate form v 1 ∪ ν 2 ∪ ν 3 , where an intervention is deemed effective if it is shown to be effective in at least one of ν 1 , ν 2 , or ν 3 . Another possibility is a composite endpoint, involving some function that maps the multivariate outcome to a univariate outcome for inference, for example ν 1 + ν 2 + ν 3 . In this case the outcomes within the composite may be assigned equal or differing degrees of relevance depending on clinical importance. 4 Alternatively, the composite endpoint may combine outcomes by labeling patients as 'responders' or 'non-responders' based on whether they exceed predefined thresholds in each of the outcomes. For instance, we let a response indicator S = 1 if ν 1 ≤ η 1 , ν 2 ≤ η 2 , and ν 3 ≤ η 3 , where η denotes the response cutpoints.
Note that the composite case is distinct from the others in that it combines the parameters and hence test statistics for each outcome into one, rather than these remaining separate for each outcome. This will have implications for sample size estimation.
For each of these endpoints, the individual outcomes may be a mix of multiple continuous, ordinal, and binary measures. One possible way to jointly model the outcomes is using a latent variable framework, arising in the graphical modeling literature, in which discrete outcomes are assumed to be latent continuous variables subject to estimable thresholds and modeled using a multivariate normal distribution. 5,6 By employing this framework we can take account of the correlation between the outcomes, improve the handling of missing data in individual components and potentially increase efficiency. Furthermore, in the case of multiple primary outcomes, it may reduce the severity of multiple testing corrections required by accounting for correlation between endpoints.
A barrier to adopting these techniques is a lack of consensus on sample size determination. A recent and comprehensive overview of the existing literature for sample size calculation in clinical trials with co-primary and multiple primary endpoints is provided by Sozu et al. 7 The review found many proposals for power and sample size calculations for multiple continuous outcomes. In the co-primary case, some of these were based on assuming that the endpoints were bivariate normally distributed, 8,9 and extended for the case of more than two endpoints. 10,11 Other work focused on testing procedures 12,13 and controlling the type I error rate. 14-17 Similar ideas were investigated for multiple primary endpoints. 14,[17][18][19] Approaches to sample size estimation for composite endpoints have focused primarily on the case of multiple binary components. [20][21][22][23][24][25] In the case of binary co-primary endpoints, five methods of power and sample size calculation based on three association measures have been introduced. 26 Additionally, sample size calculation for trials using multiple risk ratios and odds ratios for treatment effect estimation is discussed by Hamasaki et al, 27 and Song 28 explores co-primary endpoints in non-inferiority clinical trials. Consideration has also been given to the case where two co-primary endpoints are both time-to-event measures where effects are required in both endpoints, [29][30][31] and at least one of the endpoints. 32,33 Furthermore, composites comprised of time-to-event measures are common, in which the composite reflects time-to-first-event variable. 34 Sample size estimation in this case has been considered by Sugimoto et al. 35 The mixed outcome setting has received substantially less consideration. One paper considers overall power functions and sample size determinations for multiple co-primary endpoints that consist of mixed continuous and binary variables. 36 They assume that response variables follow a multivariate normal distribution, where binary variables are observed in a dichotomized normal distribution, and use Pearson's correlations for association. A modification was suggested by Wu and de Leon 37 which involved using latent-level tests and pairwise correlations, and provided increased power. Thesemethods focus on the co-primary endpoint case,where effects are required in all outcomes. The case ofmultiple primary or composite endpoints where the components are measured on different scales has not been considered, each of which will require distinct hypotheses. In practice, if a mixed outcome composite is selected as the primary endpoint in a trial then the sample size calculation may be based on an overall binary endpoint or collapsed to form multiple binary endpoints however this will result in a large loss in efficiency. 38 In this article we build on the existing work for co-primary continuous and binary endpoints to include any combination of continuous, ordinal, and binary outcomes for co-primary, multiple primary, and composite endpoints. We propose a framework based on the same latent variable model and show how it may be tailored to each of the three endpoints to facilitate sample size estimation. The article will proceed as follows: in Section 2 we introduce the latent variable model, detailing how it can be used in each context, and specify hypothesis tests for each of the three combinations of mixed outcomes; in Section 3 we propose power calculations and sample size estimation techniques in each case; in Section 4 we illustrate the methods on a four dimensional endpoint consisting of two continuous, one ordinal and one binary outcome using a numerical example based on the MUSE trial; 39 and in Section 5 we simulate the empirical power for each test and the FWER for the union-intersection test. We conclude with a discussion and recommendations for practice in Section 6, and introduce user-friendly software and documentation for implementation in Section 7.

Latent variable framework
Let n T and n C represent the number of patients in the treatment group and the control group respectively and let K be the number of outcomes measured for each patient. Let Y Ti = Y T i1 , …, Y T iK T , i = 1, …, n T be vector of K responses for patient i on the treatment arm and Y Ci = Y Ci1 , …, Y CiK T , i = 1, …, n C the vector of K responses for patient i on the control arm. Without loss of generality, the first 1 ≤ k ≤ k m elements of Y Ti and Y Ci are observed as continuous variables, the next k m < k ≤ k o are observed as ordinal and the remaining k o < k ≤ K are observed as binary. For instance, for a three dimensional endpoint with one continuous, one ordinal and one binary measure, k m = 1, k o = 2, and K = 3. We use the biserial model of association by Tate, 40 which is based on latent continuous measures manifesting as discrete variables. Formally, we say that Y Ti and Y Ci have latent variables Y Ti * and Y Ci * respectively, where Y Ti * ∼ N K μ T , ∑ T and Y Ci * ∼ N K μ C , ∑ C , where μ T = μ 1T , …μ KT , μ kT = μ kT 0 + μ kT 1 x kT 1 + … + μ kT p x kT p and x kT1 …x kTP denotes the p covariates included in the model for outcome k. Likewise μ C = μ 1C , …μ KC are the corresponding quantities for the control arm. Then for k ≠ k′: 1 ≤ k < k′ ≤ k m let V ar Y T ik = σ T k 2 , V ar Y Cik = σ Ck 2 and Corr Y T ik , Y T ik′ = ρ T kk′ , Corr Y Cik , Y Cik′ = ρ Ckk′ where ρ T kk′ and ρ Ckk′ are the associationmeasures between the endpoints. For k m < k ≤ K, V ar Y T ik * = V ar Y Cik * = 1 and The latent variables can be related to the observed variables by: We set τ k 0 = −∞, τ k (w k +1) = ∞ and the intercepts μ kT0 and μ kC0 equal to zero for k m < k ≤ k o in order to estimate the cut-points. Additionally, τ k0 = − ∞, τ k1 = 0, τ k2 = ∞ for k o < k ≤ K so that the intercepts can be estimated for the outcomes observed as binary. The mixed outcomes are then combined as follows.

Co-primary endpoint
In this case, a treatment must be shown to be effective as measured by each of the outcomes in order to be deemed effective overall. We generalize previous work for mixed continuous and binary outcomes to include ordinal outcomes, as shown below. 36,37 In many clinical trials the hypothesis of interest is based on superiority, namely that the proposed treatment will perform better than the control treatment. The null hypothesis is that the difference in treatment effects for the treatment arm and control arm is less than or equal to zero. This is straightforward to formalize in the case of one endpoint but less so when there are multiple co-primary endpoints, particularly when they are measured on different scales. The hypothesis of interest is as shown in (1) where π Tk and π Ck is the effect of the intervention in the treatment and control arm respectively. For k o < k ≤ K we can specify π T ik = P Y T ik = 0 = P Y T ik * < 0 and π Cik = P Y cik = 0 = P Y Cik * < 0 for the treatment and control group.
We can generalize this assumption to account for the ordinal endpoints based on the fact that for k m < k ≤ k o π T ik P Y T ik = w k = P τ kw k < Y T ik * < τ k w k + 1 . The definition of treatment effect for ordinal outcomes may be modified to include multiple ordinal levels by selecting the appropriate τ thresholds. For instance, As the latent means are estimable by maximum likelihood, μ T i1 * = Φ −1 π T i1 , …, μ T ik * = Φ −1 π T ik in the treatment group and μ Ci1 * = Φ −1 π Ci1 , … , μ Cik * = Φ −1 π CiK in the control group.
We can proceed by specifying that the hypothesis in (1) holds if and only if the hypothesis H 0 * : ∃ k s . t . δ k * ≤ 0 H 1 * : δ k * > 0 ∀k, (2) holds, where δ k * = μ T K * − μ CK * , μ T K * = 1/n T Σ i = 1 nT μ T ik * and μ CK * = 1/n C Σ i = 1 nC μ Cik * . The maximum likelihood estimates μ T K * and μ CK * can be used for a test of H 0 * and the variance of this test statistic can be obtained using the inverse of the Fisher information matrix.

Multiple primary endpoint
Multiple primary endpoints conclude a treatment is effective if it is shown to work in at least one of the outcomes. We would expect the sample size required to be reduced compared with the co-primary endpoint case which would require power to detect treatments in all outcomes. We can allow for sample size estimation for multiple primary endpoints as follows.
The hypothesis of interest, accounting for the fact that a significant effect in only one outcome is required, is shown below.
H 0 * : δ k * ≤ 0 ∀k The difference in latent means δ k * = μ T k * − μ Ck * and their variance are estimated using the maximum likelihood estimates and Fisher information matrix, as before.

Composite endpoint
A review conducted by Wason et al 41 showed that composite responder endpoints are widely used and identified many clinical areas in which they are common, such as oncology, rheumatology, cardiovascular, and circulation. The latent variable framework may be used to model the underlying structure of these mixed outcome composite endpoints to greatly improve efficiency. 38 The joint distribution of the continuous, ordinal, and binary outcomes is modeled using the latent variable structure as before. However, in this case the endpoint of interest is a composite responder endpoint and so the required quantity is some function of the probability of response in the treatment group p T and in the control group p C .
For instance, an overall responder index S i can be formed for patient i, where S i = 1 if Y i1 ≤ η 1 , …, Y ik * ≤ η k and 0 otherwise, where the quantities (η 1 , …, η K ) are predefined responder thresholds. Generalizations where response only requires a certain number of the components to meet the thresholds are possible, but involve more complex sums. Note that this definition of response is distinct from that commonly found in composites formed from survival endpoints or binary composites typical in cardiovascular studies. We can specify p iT and p iC , the probability of response for patient i in the treatment arm and control arm respectively, as shown in (5), where θ is the vector of model parameters and we assume that p T ∼ N δ T , σ δ T 2 and p C ∼ N δ C , σ δ 2 . As in the case of co-primary and multiple primary endpoints, the assumptions allow us to estimate latent means μ k m + 1 * , …, μ K * for the observed discrete components using the model parameters.

Europe PMC Funders Author Manuscripts
In the mixed outcome composite endpoint setting, note that although we are exploiting the latent multivariate Gaussian structure for efficiency gains we are ultimately still interested in a one dimensional endpoint, such as the difference in response probabilities between the treatment and control arms of the trial. This is distinct from the co-primary and multiple primary endpoints cases, where the overall hypothesis test must be based on some union or intersection of the hypotheses for the individual outcomes. For the composite endpoint we can formulate the hypothesis as shown in (6), where p T and p C are as in (5). For sample size estimation, we require the distribution of δ = p T − p C under H 1 , which we can assume to be δ ∼ N δ T − δ C , σ δ 2 . The hypothesis can therefore be stated as where δ* = δ T * − δ C * , δ T * = Φ K η 1 , …, η K ; μ T * , Σ T , δ C * = Φ K η 1 , …, η K ; μ C * , Σ C and Φ K (.;μ,Σ) is the K-dimensional multivariate normal distribution function, with mean vector μ and covariance matrix Σ. Estimates of the quantities can be obtained using the maximum likelihood estimates for the model parameters, as in the co-primary and multiple primary endpoint settings, so that δ T * = Φ K η 1 , …, η K ; μ T *, Σ T and δ C * = Φ K η 1 , …, η K ; μ C * , Σ C , where μ T * is the K-dimensional vector of mean values in the treatment arm and μ C * is the corresponding vector for the control arm. Using a Taylor series expansion, we can obtain the quantity σ δ 2 using the fact that var δ * ≈ ″δ T Cov θ ′δ . Then, var δ * = ″δ T

T Cov θ ″δ T ,
where ″δ is the vector of partial derivatives of δ* with respect to each of the parameter estimates. We can obtain θ and covariance matrix Cov θ by fitting the model to pilot trial data.

Co-primary endpoints
To construct the power function, we define the required quantities as follows. Let Y T k − Y Ck and μ T k * − μ Ck * denote the difference in sample means for the continuous and discrete outcomes respectively. We assume δ k = μ T k − μ Ck , δ k * = μ T k * − μ Ck * , κ = n C /n T and let z α denote the (1 − α)100th standard normal percentile, where α is the prespecified significance level. We define the z score as Z k = A useful property of Z † = Z 1 † , …Z K † T is that it is asymptotically multivariate normal under regularity conditions. 11 The power function for the joint co-primary endpoints is as shown in (10) and hence can be approximated by (11).
for δ = δ 1 , …, δ k m , …, δ k o , …, δ K Assuming n T = n C = n it is possible to rearrange (11) to obtain a sample size formula in terms of n as shown below. 7 where the sample size depends on the number of outcomes and C K is the solution of Alternatively, we can input different values for n in (11) to achieve the required power.

Multiple primary endpoints
Using the Z k † and Z k † defined for co-primary endpoints and assuming n T = n C = n, we can define the overall power as in (14).
In order to obtain an appropriate power function we rely on the inclusion-exclusion principle as follows.
A closed form expression for the overall power is shown in (15) We then input different values for n to achieve the required power. Note when using the union-intersection test for multiple primary endpoints that a correction must be applied to control the family-wise error rate (FWER). Approaches used for multiple primary continuous endpoints, such as Bonferroni and Holm corrections, may also be implemented in this setting.

Composite endpoints
As the endpoint of interest is specified in terms of the overall one dimensional composite endpoint, we can use the formula assumed when employing the standard test of proportions technique. As σ δ = σ δT 2 n T + σ δC 2 n C , we can assume that σ T = σ C = σ and n T = n C = n, so that δ ∼ N δ T − δ C , 2 σ 2 /n . The power is deduced in the standard way, as demonstrated below.
1 − β = P P T − P C > z α 2 σ 2 /n H 1 = P z > z α 2 σ 2 /n − δ* 2 σ 2 /n H 1 = Φ δ* 2 σ 2 /n − z a . (16) Note that σ σ 2 = 2 σ 2 n , however to obtain a formula in terms of the required sample size we will need to separate n from the variance estimate. By fitting the model to pilot trial data we can obtain an estimate for σ 2 , as the value of n will be known in this instance and n can be obtained using (17). . (17) This is similar to the sample size equation used for the binary method, however σ is not derived in the standard way and δ* is obtained using latent means as opposed to provided directly.

Muse trial
We illustrate the technique for sample size determination using the MUSE trial. 39 The trial was a phase IIb, randomized, double-blind, placebo-controlled study investigating the efficacy and safety of anifrolumab in adults with moderate to severe systemic lupus erythematosus (SLE). Patients (n=305) were randomized (1:1:1) to receive anifrolumab (300 or 1000 mg) or placebo, in addition to standard therapy every 4 weeks for 48 weeks. The primary endpoint in the study was the percentage of patients achieving an SLE Responder Index (SRI) response at week 24, with sustained reduction of oral corticosteroids (<10 mg/day and less than or equal to the dose at week 1 from week 12 through 24). SRI is comprised of a continuous Physician's Global Assessment (PGA) measure, a continuous SLE Disease Activity Index (SLEDAI) measure and an ordinal British Isles Lupus Assessment Group (BILAG) measure. 42 The study had a target sample size of 100 patients per group based on providing 88% power at the two-sided 0.10 alpha level, to detect at least 20% absolute improvement in SRI(4) response rate at week 24 for anifrolumab relative to placebo. The investigators assumed a 40% placebo response rate.

Computation
We have conducted the computations in R version 4.0.2. We define functions to evaluate the power for each of the endpoints using a combination of the pnorm and pmvnorm functions.

Results
The power is largest for the multiple primary endpoint, where 80% is achieved for n=37 patients in each arm. The power for the composite endpoint is similar to that of PGA, the component with the highest effect size. As we would expect the power is considerably lower for co-primary endpoints, which would require n=325 for 80% power (Figure 1). Table 1 shows the sample sizes required in each group, for the co-primary and multiple primary endpoints to obtain an overall power of at least 80% to detect a difference of 0.88 in SLEDAI, 0.38 in PGA, 0.24 in BILAG and 0.40 in the taper outcome based on the values observed in the trial. We allow for uncertainty in the variance of the continuous measures by setting σ 1 2 = 18, 19, 20 and σ 2 2 = 0.35, 0.45, 0.55, 0.65 . The sample sizes required for each individual endpoint are also shown, based on achieving a power of at least 80%. Allowing for uncertainty in the variance of the SLEDAI outcome varies the required sample size for the co-primary endpoint but not the multiple primary endpoint. The opposite is true when the assumed variance of the PGA outcome is changed, namely affecting the sample size required for the multiple primary endpoint but not the co-primary. This is intuitive given that the treatment effect observed in the SLEDAI outcome is smallest and is largest for the PGA outcome. For the co-primary and composite endpoints the power is largest when the correlation between the endpoints is high whereas for multiple primary endpoints the power is largest for zero correlation between endpoints (Figure 2).
We assume that a future trial in SLE is to be conducted using the composite responder endpoint, allowing for uncertainty in σ. The estimated variance for the risk difference from the trial dataset is σ δ 2 = 0.048 with correlation parameters ρ 12 = 0.448, ρ 13 = 0.521, ρ 14 = 0.003, ρ 23 = 0.448, ρ 24 = − 0.031, ρ 34 = 0.066. For a risk difference of 0.14, the required sample size per group is 50, compared to 135 for 88% power in the standard binary method. If the method were to be employed for increased power, rather than a decrease in required sample size, the estimated power of the latent variable method is over 99.99% for sample sizes giving 88% power at the 0.05 one-sided alpha level in the binary method. The empirical power is shown for the latent variable method in 1000 simulated datasets, which is approximately 88% for each sample size, as required. Note that the sample size for composite endpoints are highly dependent on the responder threshold chosen, which will be predefined by clinicians.

Empirical Performance Of Sample Sizes
The behavior of the sample sizes obtained for each of the endpoints can be shown empirically. Assuming the four dimensional SLE endpoint, we calculate the empirical power by simulating 100 000 datasets from the multivariate normal distribution and applying the corresponding tests for both the observed and latent continuous outcomes. The key concern for the co-primary endpoints is that the method gives the appropriate power whereas for multiple primary endpoints we must ensure the family-wise error rate is controlled.
The sample sizes required for each of the three endpoints and the corresponding empirical power is shown for effect sizes observed in the MUSE trial with low, medium, and high correlation assumed between endpoints ( Table 2). The empirical power derived is approximately equal to the desired power of 80% for all endpoints. As is well recognized in the multiple testing literature, the type I error rate must be controlled when multiple primary endpoints are tested using the union-intersection test. The degree to which the type I error rate is inflated depends on the number of outcomes and the correlation between outcomes, where lower correlation between outcomes and larger number of outcomes result in larger inflations ( Figure 3). The performance of the Bonferroni correction in this setting is shown, where it is conservative in the case of high correlation between endpoints. As the maximum correlation between outcomes in the MUSE trial endpoint used in the numerical example is 0.5, we expect the sample sizes shown for this application to be a good estimate. If very large positive correlations between the endpoints are expected the required sample size from this approach may be overestimated. The code to obtain these empirical results is provided in the 'Software' section.

Discussion
The work in this article demonstrated the various ways in which a latent variable framework may be employed for mixed continuous, ordinal, and binary outcomes. We illustrated sample size determination in the case of mixed continuous, ordinal, and binary co-primary outcomes. We extended this to allow for sample size determination in the case of mixed multiple primary endpoints and proposed a technique to estimate the sample size when using a latent variable model for the underlying structure of a mixed composite endpoint. For co-primary and multiple primary endpoints the resulting hypothesis is based on an intersection or union of the hypotheses for the individual outcomes and so is multivariate in nature. However, for composite responder endpoints the hypothesis of interest is stated in relation to the overall responder endpoint and so is univariate. Sample size estimation in this case can make use of the standard power and sample size functions but requires the distribution of the test statistic under the alternative hypothesis which we approximate using latent-level means and a Taylor series expansion.
We applied the methods to a numerical example based on a phase IIb study. For the correlation structure observed in the MUSE trial, the sample size required for the co-primary endpoint was greater than that required for the individual endpoint with the lowest effect size. Alternatively, the sample size required for the multiple primary endpoint changes based on the variance assumed for the outcome with the largest treatment effect, however is similar to that required by the individual endpoint. The sample size required for the composite endpoint was between that required for the individual outcome with the largest and second largest effect size. Given that in the composite case we are concerned with the overall binary response endpoint, we compared the sample sizes required for the endpoint using the latent variable model with the standard binary method which we showed offered a large gain in efficiency. Results of the simulated scenarios agree with previous findings that the inclusion of the ordinal component with five levels is only responsible for a very small proportion of the precision gains. Given that the inclusion of the ordinal component substantially increases complexity and computational demand, it may be sufficient to combine any ordinal components with the binary outcome if necessary. Detailed simulation results for the composite endpoint are shown in the Supplementary Material.
One practical consideration when calculating the sample size for a trial using the latent variable model is the need to specify a large number of parameters, even in the case of only a few outcomes. Estimates for the parameters could be obtained by fitting the model to pilot data however this is potentially challenging and restrictive for a number of reasons. First, it requires that a pilot or earlier phase trial must have already taken place. Furthermore, the pilot data could be fundamentally different to the future trial and observed effects may be imprecise. Therefore, placing too much emphasis on the existing data may lead to problems in the main trial. In theory, it is possible to specify the required covariance parameters without data however this would be difficult in practice. Additionally, in the case of composite endpoints, we cannot define the variance in terms of the model parameters only, as the treatment effect is defined for the one-dimensional composite and so is a function of the parameters. This means that the full covariance matrix of the estimated parameters is required for the Taylor series derivation. An alternative when there is no data available is to apply the method using the sample size required to achieve 80% power for the binary method and avail of the large increase in power. Alternatively, we can directly specify σ δ based on expert elicitation, as is sometimes the case in practice for standard onedimensional endpoints. Allowing for uncertainty in the quantities and choosing conservative values should provide an appropriate sample size estimate.
It is possible to extend this approach to use adaptive sample size re-estimation, or an internal pilot to allow for reductions in the required sample size in the trial as we collect more information about the treatment effect variability.

Software
The code to obtain the results in this article is available at https://github.com/martinamcm/ mcmenamin_2021_multsamp. A Shiny application for implementing the method is available at https://martinamcm.shinyapps.io/multsampsize/. Documentation and example data are available at https://github.com/martinamcm/MultSampSize.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Power function for individual SLEDAI (continuous), PGA (continuous), BILAG (ordinal), and Taper (binary) outcomes and the power functions with when they are treated as coprimary, multiple primary, and composite endpoints using data from the MUSE trial   Family-wise error rate (FWER) of the multiple primary endpoints shown both unadjusted and adjusted using the Bonferroni correction. FWERs are shown for K = (2, 3, 4) outcomes and correlations are constrained to be equal between all outcomes  Table 1 Sample sizes n = n C = n T for the co-primary and multiple primary endpoints for overall power 1 − β ≈ 0.80, α = 0.025, k m = 2, K = 4 using the MUSE trial data  Table 2 Sample sizes and empirical power (%) for n = n C = n T for the co-primary, multiple primary, and composite endpoints for overall power 1 − β ≈ 0.80, α = 0.025, k m = 2, K = 4 with observed and latent effect sizes