A comparison of methods for analysing multiple outcome measures in randomised controlled trials using a simulation study

Abstract Multiple primary outcomes are sometimes collected and analysed in randomised controlled trials (RCTs), and are used in favour of a single outcome. By collecting multiple primary outcomes, it is possible to fully evaluate the effect that an intervention has for a given disease process. A simple approach to analysing multiple outcomes is to consider each outcome separately, however, this approach does not account for any pairwise correlations between the outcomes. Any cases with missing values must be ignored, unless an additional imputation step is performed. Alternatively, multivariate methods that explicitly model the pairwise correlations between the outcomes may be more efficient when some of the outcomes have missing values. In this paper, we present an overview of relevant methods that can be used to analyse multiple outcome measures in RCTs, including methods based on multivariate multilevel (MM) models. We perform simulation studies to evaluate the bias in the estimates of the intervention effects and the power of detecting true intervention effects observed when using selected methods. Different simulation scenarios were constructed by varying the number of outcomes, the type of outcomes, the degree of correlations between the outcomes and the proportions and mechanisms of missing data. We compare multivariate methods to univariate methods with and without multiple imputation. When there are strong correlations between the outcome measures (ρ > .4), our simulation studies suggest that there are small power gains when using the MM model when compared to analysing the outcome measures separately. In contrast, when there are weak correlations (ρ < .4), the power is reduced when using univariate methods with multiple imputation when compared to analysing the outcome measures separately.


INTRODUCTION
In most clinical trials, a single primary outcome is specified to investigate the effect of a health intervention and this is often sufficient to determine whether the intervention is effective. However, for many diseases and disorders, a patient's health status cannot be adequately quantified using a single primary outcome. Examples include mental health disorders, stroke (Mayo & Scott, 2011) and chronic obstructive pulmonary disease (COPD) (Agusti & Vestbo, 2011;De Los Reyes, Kundey, & Wang, 2011;Teixeira-Pinto, Siddique, Gibbons, & Normand, 2009). Therefore, in these disease areas, multiple primary outcomes may be required to provide a comprehensive understanding of the effects of an intervention. These multiple outcomes may be of the same data type. For example, several continuous outcomes may be measured to quantify cognitive and behavioural components in order to evaluate the effect of cognitive behavioural therapy on patients with a depressive disorder. Alternatively, the outcomes may be of different data types. For example, researchers might measure a continuous quality of life outcome and a binary outcome to indicate symptom relapse when evaluating the effect of an antipsychotic drug on people with schizophrenia.
However, missing outcome data are a common problem for randomised controlled trials (RCTs) since it is not always possible to measure all specified primary outcomes for all participants. In fact, a review of published trials showed that outcome data were missing in the majority of trials (Bell, Fiero, Horton, & Hsu, 2014). Missing outcome data will generally result in a loss of power and may lead to biased estimates of the effect of the intervention. For example, patients in a smoking cessation trial may be more likely to drop out if they continue to smoke, and therefore the patients with observed outcome data may not be a representative sample.
Several approaches have been used to analyse trials with multiple outcomes in the presence of missing data. A common approach, which is appealing due its simplicity, has been to analyse the outcomes separately within a univariate framework (Vickerstaff, Ambler, King, Nazareth, & Omar, 2015). Patients are typically omitted from any analysis for which they have missing outcome data. However, this approach does not account for the correlation between the outcomes and consequently the precision of the estimates and the power may be lower than that achieved by other approaches . A variation on this approach is to use multiple imputation to impute missing outcome data prior to univariate analysis of the outcomes (White, Royston, & Wood, 2011). An advantage of this approach is that all outcomes may be included in the imputation model and hence the correlation between the outcomes may be accounted for (White et al., 2011).
More advanced approaches include the use of multivariate methods such as the multivariate multilevel (MM) model and multivariate regression. These methods have been used to analyse examination results in schools (Goldstein et al., 1993;Yang, Goldstein, Browne, & Woodhouse, 2002), crime trends (Mohan, Twigg, & Taylor, 2011;Tseloni & Zarafonitou, 2008) and health-related behaviour (Maas, Verheij, Spreeuwenberg, & Groenewegen, 2008). However, the use of these methods in trials has been limited despite their potential to increase power (Snijders & Bosker, 2012). For example, the MM model has occasionally been used for exploratory analyses in clinical trials (Hassiotis et al., 2009;King et al., 2002).
Our recent review of published trials in neurology and psychiatry showed that multiple primary outcomes were commonly used but often inadequately analysed (Vickerstaff et al., 2015).
When analysing multiple outcomes in a trial, it is important to control the familywise error rate (FWER), which is the probability of obtaining at least one false positive result across the outcomes. A common approach to control the FWER is to adjust the p-values produced by each statistical test (Dmitrienko & D'Agostino, 2013). Another important consideration for a trial with multiple primary outcomes is the definition of power since there are a number of ways that this may be defined. When the objective of the trial is to detect an effect for at least one of the specified outcomes, it is recommended to use the disjunctive power (Bretz, Hothorn, & Westfall, 2010;Dmitrienko, Tamhane, & Bretz, 2009) which is the probability of detecting at least one true intervention effect across the outcomes as statistically significant (Westfall, Tobias, & Wolfinger, 2011).
In this paper, we compare univariate and multivariate methods for the analysis of clinical trials with correlated multiple primary outcomes in the presence of missing data. The paper is structured as follows. First, we describe two motivating case studies. Then we present an overview of relevant methods to analyse trials with correlated multiple primary outcomes. Finally, we present the results of a simulation study to compare the bias, power and FWER achieved by some of these methods.

MOTIVATING EXAMPLES
In this section, we describe two clinical trials that will be used to illustrate the different approaches to analysis. The first trial has three continuous outcomes whereas the second has a mixture of continuous and binary outcomes.

Laser in glaucoma and ocular hypertension trial, LiGHT
LiGHT is a two-arm, individually RCT (Gazzard et al., 2019) that recruited 718 patients with ocular hypertension or glaucoma taken from six centres across the United Kingdom. The primary outcome is EQ-5D, which measures quality of life. In addition, there are two glaucoma-specific quality of life outcomes: Glaucoma Quality of Life (GQL) and Glaucoma Utility Index (GUI). There are strong correlations between the baseline values of these three outcomes (EQ-5D and GQL r = −0.51; EQ-5D and GUI r = 0.51; GQL and GUI r = −0.72). At 24-month follow-up, 652 patients provided some outcome data although only 586 participants provided data on all three outcomes. In total, 652 (9% missing), 586 (18% missing) and 602 (16% missing) patients provided data for EQ-5D, GQL and GUI, respectively.

Ten Top Tips (10TT) trial
The 10TT is a two-arm, individually RCT (Beeken et al., 2012(Beeken et al., , 2017 that recruited 537 obese patients from 14 general practices across England. The aim of the trial was to investigate the effect of the 10TT intervention in primary care on obesity, where the intervention consisted of a leaflet called 'Ten Top Tips' listing target behaviours alongside advice on repetition and context stability (Beeken et al., 2012). The primary outcome is weight change (kg) and two important secondary outcomes are change in waist circumference (cm) and blood glucose level (mmol/L). Weight change and waist circumferences are continuous outcomes whereas blood glucose level is binary after being categorised into 'standard' and 'high' groups.
High blood glucose is defined as levels greater than 7.0 mmol/L (Organisation, 2018) with 18% of the trial participants categorised in this group. Correlation coefficients measured at baseline show that weight is strongly correlated with waist circumference ( = 0.78). There is a weak/moderate correlation between weight and blood glucose level ( = 0.28) and blood glucose level and waist circumference ( = 0.36). At 3 months, 388 participants provided data on at least one outcome variable, with 383 (29% missing), 378 (30% missing) and 330 (39% missing) values provided for weight change, waist circumference and blood glucose level, respectively.

STATISTICAL METHODS FOR THE ANALYSIS OF MULTIPLE OUTCOMES
In this section, we briefly describe relevant methods that have been proposed to analyse multiple correlated outcomes in clinical trials. Let us consider a two-arm trial with trial participants and correlated primary outcomes. The th trial participant is randomly assigned to either the intervention group ( = 1) or the control group ( = 0), for = 1, … , . Let be the value of the th outcome for the th participant and 1 represents the effect of the intervention on the th outcome. The aim of the trial is to test the null hypotheses 0 ∶ 1 = 0 for = 1, … , , where the th null hypothesis states that the intervention has no effect on the th outcome. A test statistic is used to test each null hypothesis 0 and is the corresponding unadjusted p-value. Further suppose that there is an overall null hypothesis ( ) = ⋂

=1
and that the joint test statistic ( 1 , … , ) has an m-variate distribution. Statistical significance is set to . For clarity, the subscript for participants is omitted in most of the following notation and the models include no covariates in addition to that for the intervention. Unless otherwise stated, we consider only disjunctive power and hence an intervention is shown to be effective if a statistically significant result is found for at least one outcome.

Combined outcome
One straightforward approach to analysing multiple outcomes is to combine the outcomes into a single composite outcome. For example, to combine two time-to-event outcomes, we might consider the time until the first event (Dmitrienko et al., 2009). An example of this might be time from randomisation until either nonfatal ischemic stroke, fatal ischemic stroke or early death. The composite outcome approach avoids testing outcomes separately and the need to adjust p-values (Phillips & Haudiquet, 2003). The composite outcome needs to be specified before the trial starts and all of its components should be of equal importance when assessing the effect of the intervention (Montori et al., 2005). A composite outcome may not be appropriate when the effects of an intervention differ in magnitude and/or direction across the outcomes (Pogue, Devereaux, Thabane, & Yusuf, 2012). In particular, the latter may result in a large loss of power.

Analysing outcomes separately
As discussed earlier, a common approach to analysing multiple outcomes is to analyse each outcome separately within a univariate framework. However, any correlations between the outcomes are not used and an extra imputation step may be required in the presence of missing data.

Multivariate analysis
More advanced techniques, including multivariate methods (Goldstein, 2011), have been proposed that enable multiples outcomes to be analysed simultaneously taking into account the correlations between them (Teixeira-Pinto & Normand, 2009). The use of these methods could potentially lead to improved precision and greater power (McCulloch, 2008) and hence smaller sample sizes. In addition, depending on the objective of the trial, we may also estimate an overall effect of the intervention across outcomes, as well as a separate effect for each outcome.

Global statistical tests
Another multivariate approach is to use a global testing procedure to estimate an overall effect of the intervention across outcomes, with the trial deemed a success if the overall effect is statistically significant. Conceptually, the interpretation of results obtained from global test procedures and the analysis of composite outcomes are similar, and both avoid the issues associated with testing outcomes separately. However, unlike composite outcomes, global test procedures account for the correlations between outcomes. Methods include the multivariate analysis of variance (MANOVA), the one-degree of freedom global test developed by Roy (Roy, Lin, & Ryan, 2003) and the test statistics developed by O'Brien (O'Brien, 1984) and extended by Pocock et al. (Pocock, 1997). Global testing procedures require balanced data across all outcomes and will omit observations if any outcome values are missing. Given this limitation, global testing procedures are not widely used in clinical trials and therefore are not discussed further.

Multivariate regression
Multivariate regression is an extension of multiple regression that allows for multiple outcomes of the same type to be analysed. For example, this approach may be used to analyse several continuous or several binary outcomes. Multivariate regression also requires balanced data across the outcomes.

Factorisation modelling
This approach involves factorising the joint distribution of two correlated outcomes into a marginal distribution and a conditional distribution. Univariate (UV) models can then be fitted to both components of this factorisation (Teixeira-Pinto & Harezlak, 2013). It is possible to use different types of outcomes within this framework although the estimated intervention effects are likely to be different from those obtained by analysing the outcomes separately because of different distributional assumptions. With two correlated outcomes, one continuous ( 1 ) and one binary ( 2 ), we can use one of the two possible factorisations of their joint distribution 1 , 2 ( 1 , 2 ) = 1 | 2 ( 1 | 2 ) 2 ( 2 ). Fitzmaurice and Laird (1995) describe a factorisation model which uses a linear model for 1 and a probit model for 2 , and includes a covariate for intervention group. The model is 1 = 01 + 11 + ( 2 − 2 ) + ∈ 1 , ( 2 ) = 02 + 12 , where ∈ 1 ∼ (0, 2 ) is a normally distributed random variable with mean zero and variance 2 , and quantifies the association between 1 and 2 . Catalano and Ryan (1992) propose the 'reverse' of this model which uses the other possible factorisation 1 , 2 ( 1 , 2 ) = 1 ( 1 ) 2 | 1 ( 2 | 1 ). This is described in Teixeira-Pinto and Harezlak (2013). At present, there is no guidance on how to analyse more than two outcomes using the factorisation model.

Latent modelling
Several researchers have suggested methods that use latent variables (LVs) to model multiple correlated outcomes, including McCulloch (2008), Sammel, Ryan, and Legler (1997) and Dunson (2000). McCulloch (2008) suggests specifying a random effect that is shared across outcomes. Assuming we have one continuous normally distributed outcome ( 1 ) and a binary outcome ( 2 ), the model is 1 = 01 + 11 + + ∈ 1 ( 2 = 1) = ( 02 + 12 + ) , where 1 ∼ (0, 2 1 ), ∼ (0, 2 ) and 2 1 and 2 are unknown variances. We assume that the LV completely specifies the pairwise correlation between the outcomes and hence, conditional on this variable, the two outcomes are independent. The parameter accounts for the fact that the linear predictors for 1 and 2 are on different scales and will have different variances.
The estimated effects of the intervention from this model are conditional on the LV and consequently they may not be comparable to the estimates obtained from the other methods discussed. To obtain estimates for binary outcomes that are comparable to those obtained from univariate analysis, we divide the regression coefficient 12 by where 2 2 is fixed to 1 if a probit link function is used. A detailed discussion regarding coefficient adjustments can be found in Teixeira-Pinto and Normand (2008). Note that in model (2), there are three variance-covariance parameters ( 2 1 , 2 and ) that need to be estimated. However, there are only two quantities we can use to estimate these parameters and hence, it is necessary to impose an additional constraint to ensure that the model is not over parameterised and the model parameters are identifiable (Teixeira-Pinto & Normand, 2009). One option is to fix the variance of the LV 2 . A similar restriction is needed when analysing multiple continuous or multiple binary outcomes. McCulloch (2008) also investigated the use of this model for other types of outcomes. Sammel et al. (1997) discuss another LV model for mixed discrete and continuous outcomes which allows the use of any distribution from the exponential family.

The MM model
The MM model has been suggested as another approach to analyse correlated multiple outcomes. In the MM model, multiple outcomes are considered to be nested within individuals and are treated in a similar manner to how repeated measurements are treated within the multilevel modelling framework (Goldstein, 2011, Goldstein, Carpenter, Kenward, & Levin, 2009). For two continuous outcomes 1 and 2 , the following model is used: = 1 ( 01 + 11 + ∈ 1 ) + 2 ( 02 + 12 + ∈ 2 ), 1 = 1 if = 1 and 1 = 0 otherwise, where is an indicator for outcome and ∈ ∼ (0, )is the random error for the level two structure where is the unknown covariance matrix. Level one variation is not specified as the level exists solely to define the multivariate structure. The formulation as a multilevel model allows for estimation of a covariance matrix even if some of the outcome data are missing, as long as missingness at random (Goldstein, 2011). In the above model, two intervention effects are specified, one for each outcome. However, a common effect across both outcomes can also be specified. In addition, the model can handle mixed outcome types (Goldstein et al., 2009).

Binary outcomes
Equal: 50% and 65% event rate in control and intervention arms, respectively, for all outcomes.

Mixed outcomes
Constant: ES = 0.35 for all outcomes.

Missing data mechanism
Missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR)

Percentage of missing data values
Low and high levels of missingness. Percentages varied on depending on the missingness mechanisms and the number of outcomes, as described below: MCAR and MAR, two outcomes Low: 15% and 25% missing values in outcomes 1 and 2 High: 30% and 50% missing values in outcomes 1 and 2 MCAR and MAR, four outcomes Low: 15%, 15%, 25% and 25% missing values in outcomes 1, 2, 3 and 4 High: 20%, 30%, 40% and 50% missing values in outcomes 1, 2, 3 and 4 MNAR, two outcomes Low: 15% missing values in outcome 1 High: 50% missing values in outcome 1 High overlapping: 30% and 50% of observations in outcomes 1 and 2 were missing.

MNAR, four outcomes
Low: 15% missing values in outcomes 1 and 2 High: 50% missing values in outcomes 1 and 2 High overlapping: 20%, 30%, 40% and 50% of observations were missing for each of the outcomes respectively.

Summary of the multivariate methods
The factorisation, LV and MM models can handle continuous outcomes, binary outcomes or a mixture of both. In addition, these models can handle nonoverlapping missingness, where values may be missing for some but not all of some of the outcomes. That is, the number of observations does not need to be balanced across outcomes. Also, the factorisation, LV and MM models can easily be extended to handle several outcomes although the factorisation model can be cumbersome with more than two outcomes.

SIMULATION STUDY
In this section, we use simulation to compare the MM and LV models to univariate models with (MI+UV) and without multiple imputation (UV) with respect to power and FWER. Recommendations are made regarding which of these methods provide the most power while controlling the FWER. Several scenarios were considered by varying the number of outcomes, the outcome type, the correlation between outcomes, the size of the intervention effect, the missing data mechanism and the percentage of missing data values. Details of the different simulation factors are described in Table 1.
We used the following model to simulate outcome values for two continuous outcomes = ( ,1 , ,2 ), where indicates whether participant received intervention ( = 1) or control ( = 0), 1 = ( 11 , 12 ) is a vector of intervention effects for each outcome, ∈ are errors which are realisations of a multivariate normal distribution , where is the correlation between outcomes. This model was extended in an obvious way to simulate four continuous outcomes. A similar approach was used to simulate binary outcomes, with an extra final step to dichotomise the continuous outcomes at zero. The sample size was set at 260 for the continuous and mixed scenarios and 340 for the binary scenarios, with equal numbers of participants being allocated to the two intervention groups. These numbers were obtained from sample size calculations for a single outcome using the equal effect sizes in Table 1, 5% statistical significance and 80% power.
We introduced missing data under a variety of assumptions. We use three forms of missingness: (a) missingness completely at random (MCAR), (b) missing at random (MAR) and (c) missing not at random (MNAR). Missingness was implemented by simulating values from a multivariate Bernoulli distribution (Leisch et al., 1998) and setting the outcome variables to be missing according to the corresponding binary indicator. In the MAR scenarios, the probability of missingness depends on the intervention group with outcomes being 1.5 times more likely to be missing in the control arm compared to the intervention arm. To simulate the data under an MNAR mechanism, the probability of missingness was set to increase with increasing outcome values. First, a data set with no missing values was simulated. Then the observations were sorted in ascending order based on the outcome in which missing values were to be introduced and divided into quartile groups. Missingness was then introduced randomly into each quartile group using Table 2. To estimate the FWER, we specified that the intervention had no effect ( 1 = 0) then calculated the proportion of times that a significant test result was observed for at least one of the outcomes over 10,000 simulations. The Holm adjustment was used to control the FWER (Holm, 1979). To estimate the disjunctive power, a similar approach was used with a specified intervention effect ( 1 ≠ 0). The bias in the estimated effects was calculated as the difference between the average effect estimates over simulations and the true values. Monte Carlo standard errors (MCSE) were also calculated to provide an estimate of simulation accuracy for each scenario.
The following methods of analysis were used: For the univariate approach, continuous outcomes were analysed using a linear regression model and binary outcomes were analysed using a probit regression model. The latter was used as it corresponds to how the data were generated.
Multiple imputation was implemented using chained equations (MICE) since this is one of the most widely used methods to impute missing data (Sterne et al., 2009). Outcomes in the two arms were imputed separately which is equivalent to imputing missing values conditional on intervention arm. Forty imputations were used for all scenarios, which is the recommended number of imputations when 50% of the data are missing (Graham, Olchowski, & Gilreath, 2007). Estimates were pooled across imputed data sets using Rubin's rules (Rubin, 2004). The LV models used adaptive quadrature (Rabe-Hesketh, Skrondal, & Pickles, 2005) with 10 integration points to fit the models by maximum likelihood estimation.
To ensure that the model is not over parameterised, we fixed the latent factor variance to 0.8 (Grilli & Rampichini, 2006) in the scenarios with continuous and mixed outcomes. In the scenarios with binary outcomes, we fixed the latent factor to 1. The MM model was implemented in MLwiN via R using the package 'R2MLwiN'. The MI+UV model was implemented using the 'mice'package in R; and the LV method was implemented using GLLAMM (www.gllamm.org) in Stata Release 14 (StataCorp, 2015).

MCAR and MAR mechanisms
The results for the different methods in different simulation scenarios with either MCAR or MAR missing data mechanisms are shown in Tables 3a, 3b and 3c. When analysing two binary outcomes the MM model occasionally did not converge, most frequently when the correlation between the outcomes was strong ( = .8) and there was no effect of intervention. When analysing four binary outcomes, the MM model often did not converge and consequently we do not report these results. The MCSE for the estimates of FWER and disjunctive power were similar for all methods. The MCSE ranged from 0.0016 to 0.0026 for the FWER estimates and from 0.0026 to 0.0050 for the disjunctive power estimates.

Bias
The effect estimates were unbiased for the UV and MI+UV approaches when analysing continuous outcomes (results shown in the Appendix) since the intervention group, which was the predictor of missingness, was included in the models. However, a small bias was observed when analysing two binary outcomes using the MI+UV method (results shown in the Appendix). This may be because the MICE routine in R requires us to use logistic regression to impute missing values for binary variables instead of probit regression, which was used to simulate the binary outcomes. The effect estimates were unbiased when analysing two continuous outcomes using the MM model (results shown in the Appendix).

FWER
The FWER was around the 5% level for the UV approach for all outcome types, whereas it varied between 2.8% and 5.5% for the MI+UV approach when analysing continuous outcomes and was slightly conservative when analysing binary outcomes. The FWER fluctuated around the 5% level for the MM model, with the highest level of 5.8% observed when there were high levels of missing data. When using the LV model, the FWER ranged from 2.7% to 5.7% when analysing continuous and mixed outcomes.

Disjunctive power
The MM and LV models show increased disjunctive power compared to the UV and MI+UV approaches when analysing two continuous outcomes with missing data. These models show power gains over the UV approach even when there is weak correlation, although these gains are modest when the proportion of missing data is less than 30%. The MI+UV approach had lower disjunctive power than the UV approach with two continuous outcomes in weak to moderate correlation scenarios. When there is zero or weak correlation ( = .2) between the outcomes, the imputed outcome values are highly variable which leads to slightly higher empirical standard errors for the effect estimates compared to those obtained by the UV approach (shown in the Appendix). Consequently, disjunctive power is reduced when using the MI+UV approach, particularly when the correlations between outcomes are weak to moderate. When the outcomes are strongly correlated and missing data are not overlapping across outcomes, the observed outcome values are highly predictive of the missing outcome values which lead to smaller empirical standard errors for the MI+UV approach and disjunctive power similar to that of the MM and LV models. When analysing four continuous outcomes, the performance of the MI+UV approach is slightly improved (results presented in the Appendix), although the disjunctive power is still slightly lower than that of the UV approach when the outcomes are uncorrelated. With two binary outcomes, the disjunctive power of the MM model but was slightly higher than that of the other approaches. With mixed outcomes, the MI+UV approach and the MM model performed similarly with slightly higher disjunctive power than that of the UV approach. The LV model had the lowest power when the outcome correlation was 0.6 or higher.    Similar results are observed when analysing four continuous outcomes or when there are varying effect sizes (results presented in the Appendix). The MM model was the only multivariate approach we considered in these scenarios, and the MNAR scenario, as its performance was superior to that of the LV model in the previous simulations.

MNAR mechanism
Both the MM model and MI+UV approach assume that the missing values are MAR. We generated missing values under an MNAR mechanism to investigate if bias in the effect estimates could be reduced by using the correlation between outcomes. As expected, use of either the MM model or MI+UV approach did not reduce bias when the outcomes were uncorrelated. However, bias was reduced when the outcomes were strongly correlated and there were high levels of missing data. In particular, there was a notable reduction in bias when the outcome correlation exceeded 0.4. However, neither approach was able to remove the bias entirely. Results are shown in Figure 1.

Reanalysis of the LiGHT and 10TT trials
We now reanalyse the two real data sets, LiGHT and 10TT, to illustrate differences and similarities between the MM, MI+UV, LV and UV approaches. The code used to implement the MM model in Stata, R and MlwiN is provided in the Appendix.

Laser in glaucoma and ocular hypertension trial, LiGHT
In this analysis, the 24-month outcomes were used and the corresponding baseline values for each outcome were adjusted for in the models. On the EQ-5D and GUI scales, a higher score means better quality of life, whereas on the GQL scale, a higher score means poorer quality of life. When using the MM model, the GQL scale was reversed to enable the estimation of intervention effects that are in the same direction.
The results for the four models are displayed in Table 4. The UV approach uses a different number of participants for each outcome depending on the amount of missingness whereas the MI+UV, MM and LV approaches use all 652 participants for the analyses. The standard errors of the effect estimates are very similar across the approaches.
In summary, similar results are obtained from all approaches possibly due to the relatively small proportion of missing data. One advantage of the MM model is that a joint effect could also be calculated, if appropriate, for some or all of the outcomes. For example, a joint effect could be estimated for the glaucoma specific scales GUI and GQL while simultaneously estimating an individual effect for EQ-5D. In a trial scenario, the decision to estimate a joint effect would need to be made at the start of the study and documented in the statistical analysis plan.

The 10TT trial
For the analysis of the 10TT trial, the outcomes were standardised and the corresponding baseline variables were adjusted for in the various models. The estimated effect for blood glucose differs slightly between the models (Table 5) which may be due to higher proportion of missing data for this outcome. In summary, the MM and LV models allow both continuous and binary outcomes to be analysed simultaneously. However, we found that in this trial reanalysis, use of the MM and LV models made little difference to the results and conclusions when compared to those obtained using the UV approach.

DISCUSSION
In this paper, we have reviewed the statistical methodology that can be used to analyse multiple correlated outcomes in clinical trials. We have performed a simulation study to investigate differences in bias, FWER and disjunctive power achieved using the MM model, an LV model and a univariate model with (MI+UV) and without (UV) multiple imputation.
The simulation results suggest that the disjunctive power may be increased by using MM models as opposed to analysing each outcome separately with or without multiple imputation (UV and MI+UV). However, we found that the power gains were generally small unless the outcomes were strongly correlated or there were high levels of missing data. Pituch, Whittaker, and Chang (2016) and Snijders and Bosker (2012) reported efficiency gains for MM model compared to UV models in presence of missing data based on case studies.
When the pairwise correlations between the outcomes were weak, the power was reduced when using the MI+UV approach compared to using the UV approach. These findings are consistent with the results presented in Sullivan, White, Salter, Ryan, and Lee (2018), which state that MI may be less efficient than complete case analysis due to Monte Carlo simulation error. When missing values are MNAR, the MI+UV approach and the MM model produce very similar effect estimates and hence a similar level of bias. Both approaches provide some improvement over the UV model when the outcome correlations are .6 or higher. As expected, neither MI+UV nor MM removed the bias entirely. As a consequence, any inferences and conclusions made within the trial setting should be confirmed with sensitivity analyses under the alternative assumptions that the missing data are MNAR.
The MM model offers a computational advantage to the MI+UV approach as the MM model enables the analysis to be performed in just one step. In contrast, MI+UV method requires three steps: specifying the imputation model and performing the imputation; fitting the analysis model to each of the imputed data sets and combining the results across the imputed data sets.
When a single primary outcome is specified in a trial, the MM model can still be used for the analysis of secondary outcomes. Alternatively, when there are missing values in the primary outcome, both the primary and secondary outcomes may be analysed simultaneously using the MM model. In addition, the MM model allows for joint effects to be estimated although this should be documented in advance in a statistical analysis plan.
The results from the LV model are dependent on the constraints imposed on the model. In this paper, we fixed the latent factor variance. For a discussion of alternative constraints, see Skrondal and Rabe-Hesketh (2004). A limitation of our simulation study is that we only considered normally distributed continuous outcomes. As further work, we suggest investigating continuous outcomes that follow alternative distributions, including skewed distributions or those with heavier tails.
The work focused on methods that have been previously suggested to analyse multiple outcomes in clinical trials. However, other methods are also available including copula models (Chen & Hanson, 2017;de Leon & Wu, 2011) and generalised estimation equations (GEE) (Prentice & Zhao, 1991;Teixeira-Pinto & Normand, 2011). The GEE approach is robust to the misspecification of the correlation between the outcomes but has been shown to be less efficient estimates compared to the LV model (Teixeira-Pinto & Normand, 2011).

A C K N O W L E D G E M E N T S
The authors would like to thank Gus Gazzard (Moorfields Eye Hospital, University College London) for the LiGHT trial data and Jane Wardle and Rebecca Beeken for The Ten Top Tips data (Health Behaviour Research Centre, University College London). This work was undertaken at University College London/University College London Hospitals who received a proportion of funding from the UK Department of Health's National Institute for Health Research Biomedical Research Centres funding scheme.

C O N F L I C T O F I N T E R E S T
The authors have declared no conflict of interest.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results. The results reported in this article were reproduced partially due to their computational complexity.