Quantifying the robustness of primary analysis results: A case study on missing outcome data in pairwise and network meta‐analysis

Conducting sensitivity analyses is an integral part of the systematic review process to explore the robustness of results derived from the primary analysis. When the primary analysis results can be sensitive to assumptions concerning a model's parameters (e.g., missingness mechanism to be missing at random), sensitivity analyses become necessary. However, what can be concluded from sensitivity analyses is not always clear. For instance, in a pairwise meta‐analysis (PMA) and network meta‐analysis (NMA), conducting sensitivity analyses usually boils down to examining how ‘similar’ the estimated treatment effects are from different re‐analyses to the primary analysis or placing undue emphasis on the statistical significance. To establish objective decision rules regarding the robustness of the primary analysis results, we propose an intuitive index, which uses the whole distribution of the estimated treatment effects under the primary and alternative re‐analyses. This novel index is compared to an objective threshold to infer the presence or lack of robustness. In the case of missing outcome data, we additionally propose a graph that contrasts the primary analysis results to those of alternative scenarios about the missingness mechanism in the compared arms. When robustness is questioned according to the proposed index, the suggested graph can demystify the scenarios responsible for producing inconsistent results to the primary analysis. The proposed decision framework is immediately applicable to a broad set of sensitivity analyses in PMA and NMA. We illustrate our framework in the context of missing outcome data in both PMA and NMA using published systematic reviews.


| INTRODUCTION
Undertaking a systematic review involves making several decisions with respect to data collection and quantitative analysis. Sensitivity analysis is a critical element of the systematic review process to explore the impact of different data and/or modelling decisions on the results. If results remain consistent across the alternative re-analyses, they can be considered robust; otherwise, the primary analysis results should be interpreted with great caution. The most commonly used strategy for sensitivity analysis consists of excluding one trial at a time and examining the impact of the removal on the summary treatment effect. 1,2 Often, we are more interested in exploring how sensitive the primary analysis results are to certain assumptions about the data, such as the missing at random (MAR) assumption regarding the outcomes of the missing participants.
Addressing missing outcome data (MOD) in the evidence synthesis framework faces many challenges due to the aggregate nature of the extracted data limiting the analysis strategies compared with the options at the patient-level. 3 The literature advocates the importance of undertaking principled sensitivity analyses, especially when dealing with MOD, to assess the robustness of the primary analysis results. For instance, the report titled 'The Prevention and Treatment of Missing Data in Clinical Trials' by the National Research Council's Committee on National Statistics explicitly recommends that 'Sensitivity analyses should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting'. 4 Despite the guidance, authors of systematic reviews appear not to pursue sensitivity analysis regularly.
Recent empirical evidence has revealed that most systematic reviews addressing MOD in the primary analysis did not perform sensitivity analysis. [5][6][7] Of those performing sensitivity analysis, the majority applied exclusion or imputation of MOD before the analysis. Imputation concerned binary outcomes and included clinically implausible scenarios for the outcome of missing participants in the compared interventions. [5][6][7] For instance, all missing participants in both intervention arms were assumed to have experienced the outcome. Such sensitivity analyses are inadequate and inappropriate concerning (a) the scenarios for the missingness mechanisms (i.e., the reasons for MOD) and (b) the analysis methods to handle MOD. A proper sensitivity analysis must include a series of gradually stringent yet clinically plausible scenarios for the missingness mechanisms across the compared interventions. Then, an appropriate re-analysis of the data must include the simultaneous modelling of missing and observed outcomes for the different scenarios about the missingness mechanisms across the compared interventions while accounting for the uncertainty due to MOD.
To initiate a paradigm shift in the analysis of MOD in systematic reviews, we propose a novel framework for sensitivity analysis that promises to engage the researchers in a proper and thorough investigation of the impact of MOD on the pairwise meta-analysis (PMA) and network meta-analysis (NMA) results. To objectively infer the robustness of the primary analysis results, we developed an index that measures the overall 'distance' between the primary analysis results and the results obtained under a series of alternative re-analyses. In the present study, we illustrated this index within PMA and NMA for different scenarios about the missingness mechanism in the compared interventions. This index What is already known?
• Sensitivity analysis is a critical element of the systematic review process. It becomes imperative in the analysis of aggregate missing outcome data, where the exact missingness mechanisms are unknown. • Decisions on the robustness of the primary analysis results mostly emphasise the statistical significance and any changes in the summary treatment effect.
What is new?
• We proposed an intuitive robustness index that quantifies the similarity of the results from the primary analysis and alternative pre-specified re-analyses, and we developed a straightforward decision rule to infer the robustness. • We also proposed a simple graph that illustrates the implications for the primary analysis results of different scenarios about the missingness mechanisms in the compared arms.

Potential impact for Research Synthesis Methods readers outside the authors' field
• The proposed framework of sensitivity analysis is immediately applicable for a broad set of sensitivity analyses within and outside the health fields to ensure objectivity in the conclusions for the robustness of the primary analysis results in pairwise and network meta-analysis.
considers the location and uncertainty of the estimated summary treatment effects under the different scenarios for the missingness mechanism. Hence, this index is potentially superior to alternative measures that merely use the magnitude and direction of the estimated treatment effects, or the statistical significance to infer the robustness of the primary analysis results. This index is particularly attractive in NMA, where the number of compared interventions renders the application of the sensitivity analysis in every pairwise comparison tedious. Our proposed framework additionally includes a comprehensive and user-friendly graphical tool that illustrates the estimated parameter of interest (e.g., summary treatment effect), and uncertainty thereof, under progressively stringent yet clinically relevant scenarios for the missingness mechanisms using a single analysis. This tool aims to replace current suboptimal sensitivity analysis approaches, which (a) rely on clinically extreme scenarios for the missingness mechanisms in the compared interventions, and then, (b) apply these scenarios in separate analyses.
We have structured the article as follows. Section 2 introduces two motivating examples: a Cochrane review on a continuous primary outcome and a published systematic review with NMA on a binary primary outcome. In Section 3, we briefly describe the modelling framework for the appropriate handling of aggregate MOD (i.e., one-stage pattern-mixture model); then, we present the robustness index and the proposed graphical tool. In Section 4, we apply the proposed sensitivity analysis framework to the motivating examples. Discussion of our framework in the light of relevant published literature is given in Section 5. Brief recommendations for best practice in the analysis of MOD in systematic reviews are provided in Section 6.

| MOTIVATING EXAMPLES
To investigate whether different scenarios for the missingness mechanism across the compared interventions may compromise the robustness of the primary analysis results with respect to treatment effectiveness, we considered the following motivating examples. Both examples refer to a harmful outcome; therefore, a negative value of the effect measure (see Section 3.6) favours the first intervention of the comparison, whereas a positive value of the effect measure favours the second intervention of the comparison.

| A pairwise meta-analysis on a continuous outcome
Taylor et al. 8 compared inositol with glucose (placebo) in terms of the resolution of a depressive episode using the Hamilton Depression Rating Scale. We selected this Cochrane review from previous work 5 for having at least a moderate number of MOD that were unbalanced in the compared arms in all included trials. Namely, both the percentage of total MOD and the difference in the percentage of MOD in the compared arms were more than 5% in every trial (Table 1 and Tables S1 and S2 in supporting information).

| A network meta-analysis on a binary outcome
Baker et al. 9 compared seven pharmacologic treatments with each other and placebo in chronic obstructive pulmonary disease (COPD) patients ( Figure S1, supporting information). We focused on the exacerbation of COPD as the primary binary outcome. The authors 'lumped' the interventions in four classes, but we considered the interventions separately as reported in the original trials. For illustrative purposes, this network was selected from a previous work 10 for having at least a moderate number of MOD in most trials and observed comparisons ( Table 2 and Table S3 in supporting information).

| METHODS
3.1 | One-stage pattern-mixture model for binary and continuous outcomes Consider a collection of N trials with different sets of T interventions for a pre-specified population and outcome (binary or continuous). In arm k of trial i, we extract information on the number of participants completing the trial (hereinafter named completers), c ik , the number randomised, n ik , and the measured outcome (e.g., the number of participants responding to the randomised intervention, or the mean outcome and standard deviation as measured among the completers). Using a pattern-mixture model, we model missing and observed outcomes simultaneously, and hence, we maintain the randomised sample. In arm k of trial i, we define the pattern-mixture model as follows: where θ ik is the underlying outcome in arm k of trial i given the randomised participants, θ o ik and θ m ik are the underlying outcomes given the completers and missing participants, respectively, and q ik is the probability of MOD. In the case of a binary outcome, and in the case of a continuous outcome, where I ikj , and M ikj are dummy variables that indicate whether participant j experienced the event or completed arm k in trial i, respectively, and Y ikj indicates the continuous outcome measured in participant j.

| Informative missingness parameters
White et al. 11 and Mavridis et al. 12 proposed replacing the missingness parameter, θ m ik , with the informative missingness odds ratio (IMOR) parameter in the logarithmic scale, or the informative missingness difference of means (IMDoM) parameter, for a binary or continuous outcome, respectively. These parameters measure the departure from the MAR assumption: non-zero departure implies informative (nonignorable) missingness mechanisms, whereas zero departure implies MAR. Parameters φ ik and ψ ik are unknown, and their values can be either suggested or estimatedthe latter is possible when Bayesian methods are used. Under the Bayesian framework, it is straightforward to estimate these parameters by assigning a plausible prior distribution on them, which enables learning about the missingness mechanism. A normal prior distribution is a natural choice for φ ik and ψ ik . We may structure the prior distribution to be specific to the intervention, the trial, or the trial-arm, as well as to be fixed, exchangeable or independent across interventions, trials, or trial-arms. 10,13 In the present article, we applied intervention-specific log IMORs and IMDoMs that are exchangeable across the trials, for i = 1, 2, …, N and k = 1, 2, …, a i , where a i is the number of arms in trial i.

| Clinically plausible scenarios for sensitivity analysis
In line with published recommendations, we considered the MAR assumption (i.e., ξ = 0) to be the primary analysis for being a reasonably plausible assumption. 14,15 . We selected the following values of ξ for log IMOR to reflect a different degree of informative missingness: where e −log(X) indicates that the odds of an event is X times more likely in completers than in missing participants, whereas e log(X) indicates the opposite. For IMDoM, we considered the following values of ξ: where −X indicates that the outcome increases by X on average in completers as compared to missing participants, whereas X indicates the opposite. Therefore, we get a 5 2 × 2 matrix with 25 unique pairs of values (hereinafter named 'scenario matrix') that indicate 25 different scenarios for the pairwise comparison (first column for the active intervention, second column for the control intervention): While the scenarios mentioned above are directly applicable in a star-shaped network, as all active interventions are compared with a single control intervention, they are not immediately applicable in a non-star-shaped network, where an intervention may serve as the control in one trial and as the active treatment in another trial. Failure to ensure transitive scenarios for the missingness mechanisms across the observed comparisons of the non-star-shaped network may compromise the validity of the NMA results. 16 A viable solution is to consider the same scenarios for all active interventions of the network. 16 This solution requires that we select a reference intervention for the nonstar-shaped network that receives different IMDoM/IMOR values from the rest of the interventions for scenarios that differ in the compared arms. 16 In the supporting information, we illustrate the concept of transitivity in the missingness mechanisms in a triangle network. For instance, for a non-star-shaped network with four interventions (A, B, C, and D, where D is the reference intervention), the scenario matrix for log IMOR is the following:

| The Kullback-Leibler divergence measure
Before presenting the robustness index, we introduce the Kullback-Leibler divergence (KLD) measure, a commonly used measure to compare two probability distributions. 17 In the present study, the KLD measure is a function of the posterior mean and posterior standard deviation of the summary treatment effect under the MAR assumption (i.e., ξ = 0) and each informative scenario (i.e., ξ ≠ 0). The true missingness mechanism is not known, and we consider the MAR assumption to be a reasonably plausible assumption for the primary analysis following the recommendations of the relevant literature. 14,15 Note that the prior and posterior distributions of a summary treatment effect are conjugate; in this case, both distributions follow the normal distribution. 18 Therefore, the KLD measure from the informative sce- being the MAR assumption) to the MAR assumption is calculated as follows: where N i and N 0 refer to the normal posterior distribution under informative scenario i and the MAR assumption, respectively,μ i andμ 0 are the corresponding posterior means, and s i and s 0 are the corresponding posterior standard deviations. The KLD values range from zero to infinity, with zero indicating a perfect match of the distributions. Intuitively, it follows that the lower the KLD value, the 'closer' the two distributions are, and thus, approximating the MAR assumption with an informative MOD scenario does not essentially change the information conveyed from the former. The KLD measure is a divergence and not a distance measure, which means that it does not share the symmetrical properties of a distance measure (see supporting information).

| The robustness index
The ultimate goal is to combine the KLD values properly across all informative scenarios to infer the robustness of the primary analysis results. Therefore, we propose the robustness index (RI), which quantifies the overall 'distance' between the primary analysis results and the results from the alternative re-analyses, and is a function of the KLD measure. In the context of MOD, RI measures the overall distance of the KLD measures from zero for all |S| informative scenarios (here, |S| = 24) Equation (1) is similar to the Euclidean distance; however, we are only interested in the overall distance from zeronot in the distance among all |S| KLD values. We calculate RI only for the summary treatment effects due to the conjugate 'nature' of their prior and posterior distributions. Note that we can calculate as many RI(s) as the number of possible comparisons in the network. For a network of T = 4 interventions, we have six possible comparisons (i.e., T(T − 1)/2), and hence, we can calculate a total of six RI. For a PMA, we obtain only one RI.
A 'typical' sensitivity analysis uses only a fraction of the information concerning the estimated summary treatment. This usually involves the evaluation of the point estimate and the credible interval (CrI) of the summary treatment effect to infer whether its magnitude, direction, and conclusion differ in different re-analyses of the same outcome. Contrariwise, the RI, being a function of the KLD measure, retains the richness of information as it considers the whole posterior distribution of the estimated summary treatment effect under the alternative re-analyses.

| Setting the threshold of robustness
To our knowledge, there are no universally accepted directions on what constitutes similar results when several re-analyses are undertaken and compared to the primary analysis for the summary treatment effects. In this work, we introduce an intuitive rule, quantified by the RI, to classify the overall 'distance' between the primary and alternative re-analysis results as 'low' and 'substantial' translating into presence and lack of robustness, respectively. Specifically, a 'low' overall distance is defined as a distance from zero that is less than one between-trial standard deviation (τ) of low statistical heterogeneity: RI < τ low . As low statistical heterogeneity, we considered the median of the empirically-based distribution for τ 2 in the case of a general health-care setting. 19,20 This median is equal to 0.03 in the standardised mean difference (SMD) scale and 0.08 in the log odds ratio (OR) scale. Then, the corresponding threshold of robustness is equal to ffiffiffiffiffiffiffiffi ffi 0:03 p = 0:17 in the SMD scale and ffiffiffiffiffiffiffiffi ffi 0:08 p = 0:28 in the log OR scale (i.e., 1.32 after exponentiation). Therefore, RI < 0.17 or RI < 0.28 indicates 'robustness', and RI ≥ 0.17 or RI ≥ 0.28 implies 'lack of robustness' in the SMD and log OR scale, respectively. Different prior distributions for τ 2 will result in different thresholds for robustness to some extent: the larger the median of the distribution (as a threshold of 'low' statistical heterogeneity), the larger the threshold of robustness. For instance, in the case of the half-normal prior distribution for τ with variance 0.5 or 1, 21 the median is equal to 0.34 and 0.67, respectively, and thus, the corresponding D T values are 0.34 and 0.67. Assuming a binary outcome, D T of 0.34 or 0.67 on the log OR scale is equivalent to 1.40 or 1.95 on the OR scale. Particularly, the latter value indicates considerable information loss and should not be used to infer the robustness of the primary analysis. In other words, the selection of the prior distribution for τ 2 to perform Bayesian NMA should not dictate the threshold of robustness. We provide some discussion on the selection of robustness thresholds later on.
In the NMA setting, it is more challenging to infer the overall robustness of the primary analysis results since RI may differ across comparisons: for instance, robustness may be inferred for some comparisons, but not for others. To determine the presence or lack of robustness in the whole network, we recommend the following straightforward decision framework: 1. presence of robustness, when RI is less than the selected threshold of robustness for all possible comparisons in the network; 2. lack of robustness, when RI is equal to or exceeds the selected threshold of robustness for at least one comparison in the network.
Using the heatmap, we can illustrate RI for all possible comparisons in the network and use green and red colours to indicate the presence and lack of robustness, respectively.

| Graphical tool: Enhanced balloon plot
To illustrate the summary treatment effects of the compared interventions for the scenarios presented in Section 3.1.2, we propose the 'enhanced balloon plot'. The following description refers to a PMA, and hence, the terms 'active arm' and 'control arm'. For NMA, the corresponding terms are 'non-reference intervention' and 'reference intervention'. The proposed plot has the following characteristics: 1. For the pairwise comparison, the plot's x-axis and yaxis refer to the same values of IMDoM/IMOR for the active and control arm, respectively. Therefore, each pair of coordinates (i, j) indicates a specific scenario from the scenario matrix. Figure 1 illustrates the main structure of the balloon plot with scenarios for the IMDoM parameter ( Figure 1a) and IMOR parameter (Figure 1b). The four corners reflect the most extreme MOD scenarios. Scenarios above the main diagonal (grey dashed line) indicate a larger IMDoM or IMOR value for the control arm than the active arm, and scenarios below the main diagonal indicate the opposite. Scenarios on the main diagonal refer to the same IMDoM or IMOR value in both arms. The points (0, 0) and (1, 1) correspond to the MAR assumption for IMDoM and IMOR, respectively. Scenarios closer to MAR (the white area) are considered optimistic, and scenarios further from MAR (the light grey area) are sceptical. Scenarios outside the range (the grey area) are considered implausible. 2. For each (i, j), the bubble (point) indicates the estimated effect size. The colour of each bubble reflects the uncertainty around the estimate. The size of each bubble is proportional to the magnitude of the effect: the larger the treatment effect, the larger the bubble. 3. We use 'crossed bubble' to indicate scenarios associated with conclusive evidence (i.e., zero is not included in the 95% credible interval of SMD or log OR) and a 'filled bubble' for scenarios with inconclusive evidence.
The enhanced balloon plot can be constructed for any parameter of the PMA or NMA model similar to the summary treatment effects. While the focus is on the summary treatment effects, we additionally present the enhanced balloon plot for the between-trial heterogeneity parameter, τ 2 . In the case of τ 2 , we use 'crossed bubble' to indicate scenarios that are associated with considerable statistical heterogeneity (i.e., posterior median of τ 2 ≥ median of the selected empirically-based prior distribution for τ 2 ) and a 'filled bubble' for scenarios with low statistical heterogeneity.
In Section 4, we illustrate how the proposed RI and the enhanced balloon plot can be used jointly in the proposed sensitivity framework for a PMA and a network of interventions. Since RI signals whether a comparison is associated (or not) with robust primary results, the enhanced balloon plot can demystify those scenarios with substantially deviant results from MAR that may contribute to the lack of robustness in that comparison.

| Model implementation
For each scenario, we performed one-stage Bayesian random-effects PMA for the Cochrane review, 22 and one-stage Bayesian random-effects NMA with consistency equations for the published systematic review while incorporating the pattern-mixture model. 10 We considered the log OR and the SMD as the effect measures which are intuitively related to the log IMOR and IMDoM parameters. For the NMA, we considered the scenario matrix in Section 3.1.2 for non-starshaped networks with placebo as the reference intervention. For the location parameters, we assigned normal prior distribution with mean 0 and variance 10,000. We assigned empirically-based prior distributions on τ 2 : for the binary outcome, log-normal prior distribution with mean −2.06 and variance 1.51 2 (median: 0.13, 95% range: 0.007-2.10), 20 and for the continuous outcome, log-t prior distribution with mean −2.99, variance 2.16 2 , and 5 degrees of freedom (median: 0.049, 95% range: 0.001-4.70). 19 These prior distributions refer to the 'symptoms reflecting the continuation of condition' and the 'mental health indicators' outcome-types, The main structure of the enhanced balloon plot. Both axes refer to the same range of scenarios about the informative missingness difference of means (IMDoM) parameter for a harmful continuous outcome (plot A) and the informative missingness odds ratio (IMOR) parameter for a harmful binary outcome (plot B). Each pair of coordinates (i, j) indicates a specific scenario from the scenario matrix. The four corners reflect the most extreme scenarios. The points (0, 0) and (1, 1) correspond to the MAR assumption for IMDoM and IMOR, respectively. The white area covers optimistic scenarios for being closer to the MAR assumption, the light grey area covers sceptical scenarios for being further from the MAR assumption, and the grey area implies implausible scenarios that are outside the range. IMDoM, informative missingness difference of means; IMOR, informative missingness odds ratio; MAR: missing at random; MOD, missing outcome data [Colour figure can be viewed at wileyonlinelibrary.com] respectively, for the 'pharmacological versus placebo' comparison-type. 19,20 We ran three chains of different initial values with 100,000 iterations; 10,000 burn-in; and thinning equal to 5 and 10 for the PMA and NMA, respectively. We assessed convergence using the Gelman-Rubin convergence diagnostic and visual inspection of trace-plots and autocorrelation plots. 23 We used JAGS 24 via the R-package R2jags 25 (statistical software R, version 3.6.1 26 ) to implement the models, and the R-package ggplot2 27 to draw the enhanced balloon plot and the heatmap. All functions related to this manuscript are publicly available at https://github.com/ LoukiaSpin/Quantifying-Robustness-in-Meta-analysis.git.

| Pairwise meta-analysis: Continuous outcome
The summary treatment effects (SMDs) from the primary and sensitivity analyses are shown in the enhanced balloon plot (Figure 2a). Under the MAR assumption, the posterior mean of SMD was equal to −0.10 (95% CrI: −0.60 to 0.39) in favour of inositol, but the CrI included zero (inconclusive evidence). Under the best-case scenario for placebo (bottom right), SMD increased to zero, concluding both interventions equally effective. Contrariwise, in all other scenarios, inositol prevailed placebo (SMD range: −0.19 to −0.02).
Scenarios that assumed a larger IMDoM in the placebo than inositol (scenarios above the main diagonal that favour inositol) led to a larger posterior mean of SMD (SMD range: −0.19 to −0.11) compared to the primary analysis. On the contrary, scenarios that assumed a larger IMDoM in inositol than placebo (scenarios below the main diagonal that favour placebo) led to the same or a smaller posterior mean of SMD (SMD range: −0.08 to 0.00) as in the primary analysis. As expected, since the outcome was harmful, a larger IMDoM in the experimental than control arm corresponded to a relatively worse outcome in the former as compared to the latter. Therefore, SMD was pulled towards zero in these scenarios. Contrariwise, a smaller IMDoM in the experimental than control arm corresponded to a better outcome in the former than in the latter, and hence, SMD was pulled towards larger negative values.
Scenarios that assigned the same IMDoM in both interventions (main diagonal) yielded SMDs closer to the primary analysis, as compared to scenarios with IMDoM of opposite sign in the compared interventions or informative scenario in one arm but MAR in the other arm. However, the evidence was inconclusive in all scenarios (i.e., zero was included in the 95% CrI). Overall the uncertainty around SMD was very similar across the different scenarios (posterior standard deviation range: 0. 25-0.26).
Different scenarios about IMDoM in the compared interventions had a negligible impact on the posterior median of τ 2 (range: 0.029-0.035) ( Figure S2, supporting information). Most scenarios yielded a relatively similar posterior standard deviation of τ 2 to the primary analysis (MAR: 0.35), apart from the 'best-case for inositol' and the 'better outcome among MOD in both arms' that yielded a higher posterior standard deviation equal to 0.53 and 0.51, respectively. The posterior median of τ 2 was lower than the median of the empirically-based prior distributions, and hence, reflected low statistical heterogeneity across all scenarios. However, the posterior standard deviation of τ 2 indicated some uncertainty in estimating the parameter as the 95% CrI ranged from very low (around 0.0001) to considerable values of τ 2 (around 0.60) across all scenarios. The small number of trials may have contributed to the wide overall 95% CrI of τ 2 . Figure 2b illustrates the distribution of the KLD measures across extreme, sceptical and optimistic scenarios for IMDoM. Despite the moderate and substantial risk of attrition in the included trials, the KLD measure ranged from 0.00 to 0.07, indicating low 'distance' overall from every informative scenario to the primary analysis. The RI confirmed the robustness of the primary analysis results when compared to the different scenarios of IMDoM for inositol versus placebo (RI = ffiffiffiffiffiffiffiffi ffi 0:02 p ). Robustness may be an implication of adjusting the within-trial results by modelling MOD under different scenarios for the missingness mechanisms.

| Network meta-analysis: Binary outcome
We have obtained the estimated treatment effects of 28 possible pairwise comparisons in total (i.e., T 2 À Á with T = 8) for every scenario of the scenario matrix. In this case, it is not advisable to create 28 balloon plots and inspect them separately as it would be challenging to decide on the robustness of the primary analysis results in every comparison. Thus, we propose to start with the assessment of the heatmap on the RI of all possible comparisons and then create the balloon plots only for those comparisons whose results were contentious compared with the primary analysis results.

| Heatmap on the robustness index for all possible comparisons in the network
According to the heatmap (Figure 3), the results of the primary analysis were not robust to different IMOR scenarios in all comparisons with placebo (range of RI: 0.39-0.57). For illustration, we focused on the four comparisons with placebo that yielded the largest RI: fluticasone, fluticasone plus salmeterol, formoterol, and tiotropium versus placebo. The KLD measure approached or exceeded the threshold of robustness (i.e., RI = 0.28) for the worst-case scenarios (under extreme scenarios) for these comparisons (Figure 4). Overall, more distant scenarios yielded larger values of the KLD measure compared with less or no distant scenarios (Figure 4). For each comparison, among the more distant scenarios, the extreme ones resulted in larger values of the KLD measure, followed by the sceptical and optimistic ones.
The comparisons without placebo demonstrated the robustness of the primary analysis results (range of RI: 0.03-0.20, Figure 3). A plausible explanation may be that, after adjusting for MOD, any residual bias in the log OR of comparisons with placebo (the reference intervention) may be cancelled out to some extent in the log OR of the comparisons without the reference intervention due to the consistency equation. 28 Using the decision framework of robustness in the whole network (Section 3.4), we concluded that the credibility of the primary analysis should be called into question for the analysed network. Figure 5 presents a series of enhanced balloon plots for the four aforementioned comparisons with placebo that failed to demonstrate the robustness of the primary analysis results. In line with the results from the PMA (Section 4.1), assigning the same IMOR in the compared interventions (main diagonal) yielded results closer to the primary analysis, followed by scenarios with the same direction but different IMOR value, as compared to scenarios with opposite direction. In line with Figure 4, the ORs obtained under the two extreme scenarios (the corners in the top left and in the bottom right of the balloon plot) deviated from the primary analysis the most in all comparisons.

| Enhanced balloon plots of comparisons with lack of robustness
In fluticasone versus placebo (Figure 5a), all scenarios indicated fluticasone as more efficacious than placebo, while different informative scenarios affected the magnitude of OR substantially. Assuming a larger IMOR in the placebo than fluticasone led to larger OR favouring fluticasone (range: 0.63-0.75) compared to the other way around (range: 0.78-0.94), yet this evidence was inconclusive in all scenarios. Overall, there was some variability in the uncertainty around the log OR across the scenarios: according to the colour scale, assigning IMOR <1 in both arms yielded slightly more precise results than assigning IMOR >1 in both arms. The same conclusions were drawn for tiotropium versus placebo; however, the evidence was conclusive favouring tiotropium across all scenarios (Figure 5b).
In Figure 5c, scenarios with a larger IMOR in the placebo than formoterol yielded larger OR favouring the latter (range: 0.75-0.88). On the contrary, scenarios with a larger IMOR in formoterol than in placebo yielded OR closer to or slightly larger than one (range: 0.93-1.12). The evidence was inconclusive in all scenarios. Overall the uncertainty around log OR was similar across the scenarios (posterior standard deviation range: 0.26-0.29).
For the combination intervention versus placebo (Figure 5d), conclusions were similar to those for fluticasone and tiotropium versus placebo. The four most extreme of the scenarios that assigned larger IMOR in the placebo than the combination intervention yielded conclusive evidence in favour of the latter. Under a harmful outcome, a larger IMOR in the placebo than the active interventions yielded a smaller OR favouring the latter. Furthermore, it changed the conclusions for the combination intervention versus placebo. A possible explanation may be that these scenarios led to a relative increase in the odds of exacerbations in the placebo compared to the active interventions. On the contrary, a larger IMOR in the active interventions than placebo yielded a larger OR favouring the latter and changed its direction in formoterol versus placebo. Furthermore, IMOR < 1 in both arms corresponds to logit(θ m ) < logit(θ ο ), and hence, to a smaller variance in logit(θ ο ) as compared to logit(θ m ). In combination with q < 0.5 observed in all trials of the network, θ ο had more weight in the estimation of the arm-specific risk θ and variance thereof as compared to θ m (via the pattern-mixture model). Consequently, these scenarios yielded more precise log OR compared to scenarios with IMOR > 1 in both arms. The precision in the estimation of log OR in the remaining scenarios was somewhere in-between.

| DISCUSSION
We provide a decision framework to infer robustness of the primary analysis results in PMA and NMA by F I G U R E 5 A panel of enhanced balloon-plots for four selected comparisons with placebo. Bubbles refer to the posterior mean of summary odds ratio (OR) after exponentiation of the posterior mean of summary log OR under different scenarios of the informative missingness odds ratio parameter. The size of the bubbles is proportional to the magnitude of log OR after rescaling into [0, 1]. ORs below one favour the active intervention, and ORs above one favour placebo. The framed numbers refer to the primary analysis. Darker tones of blue refer to smaller values of the posterior standard deviation of log OR, and darker tones of red refer to larger values of the posterior standard deviation of log OR. Crossed bubbles indicate scenarios associated with conclusive evidence (i.e., zero is not included in the 95% credible interval of log OR) and filled bubbles indicate scenarios with inconclusive evidence. IMOR, informative missingness odds ratio [Colour figure can be viewed at wileyonlinelibrary.com] conducting appropriate sensitivity analyses. This work addresses two key limitations in the current application of sensitivity analysis in systematic reviews. First, we promote a comprehensive sensitivity analysis process founded on plausible scenarios with respect to the investigated characteristic (here, MOD) and proper statistical analysis for all scenarios in one stage. Second, we offer a straightforward index of robustness and an intuitive decision framework to conclude on the robustness of the primary analysis results objectively. The RI is particularly useful in NMA for an efficient evaluation of the robustness of the primary analysis results in the whole network.
The decision to illustrate the proposed framework in the context of MOD was driven by the inherent uncertainty about the missingness mechanisms, which renders the assessment of the primary analysis results in a sensitivity analysis imperative. However, the RI is relevant to a broader set of sensitivity analyses, for instance, when we investigate the robustness of the summary treatment effects to different prior distributions for τ 2 , 29 or to different assumptions for the probability of publication bias in NMA. 30 The researchers have the liberty to adjust the threshold of robustness to the aims of the systematic review. For instance, if the aim is to develop health policy guidelines (e.g., through HTAs), a more stringent robustness threshold may be preferred. If the systematic review is fundamentally exploratory (e.g., to appraise the quality of the available relevant evidence and elicit knowledge gaps), the researchers may adopt our thresholds of robustness. Moreover, we have proposed a threshold of robustness that is specific to the effect measure. Since different effect measures are associated with a different extent of statistical heterogeneity, 31 we do not recommend using a 'common' threshold of robustness for all effect measures. The importance of sensitivity analysis and the relevance of our proposed framework is not restricted to the health field, as systematic reviews are also popular in social sciences (e.g., Campbell reviews) and they are subject to the same considerations for transparent and rigorous analysis.
Furthermore, the size and number of trials, as well as the frequency of the outcome, and the structure of the network can affect the posterior distribution of the summary treatment effects, and by extent, the KLD measure and the index RI. For instance, in a sparse network (i.e., a network with few connections between interventions informed by few trials) or a meta-analysis with few small trials on a rare outcome, the posterior distribution of the summary treatment effects is sensible to the prior distribution for τ 2 due to the limited available evidence. 32 In this case, we advise the researchers to prefer stringent thresholds of robustness to avoid spurious conclusions on the robustness of the primary analysis results. In more dense networks, the researchers can use our proposed thresholds.
We have not incorporated any information on the clinical relevance of the accepted deviation that defined our proposed thresholds of robustness and that partly comprises a limitation of our study. However, by using an empirically-based distribution for τ 2 , we have attempted to reduce the subjectivity in determining the threshold of robustness. Ideally, a clinically accepted difference in the treatment effects between the primary and sensitivity analyses should be explicitly defined in the systematic review protocol, and it should be driven by the aims of the systematic review and the expert opinion on the interventions and condition under investigation. To mitigate the subjectivity in the expert opinion, we advise the researchers to consult several experts to elicit the clinically relevant accepted deviation in treatment effects and calculate the weighted average of the suggested thresholds using the years of experience as weights. 33 The researchers are encouraged to use our proposed threshold to investigate the sensitivity of the clinically relevant threshold to a more or less conservative threshold (if the elicited threshold is higher or lower than the proposed one) to preclude possible spurious robustness.
Our proposed sensitivity analysis framework is highly essential for rating the certainty in the evidence concerning the risk of attrition bias via the GRADE approach. The current GRADE approach lacks the objectivity in demonstrating robustness: 'to the extent that pooled estimates remain similar when making progressively more stringent assumptions (and in particular, results remain statistically significant), one would conclude that the results are robust to the missing data […]'. 34 To infer the similarity between two or more summary treatment effects, we need to quantify similarity, and then employ an objective decision framework to support our conclusions. With our proposed RI, we have achieved to offer a framework to infer robustness objectively. Rather than relying on the location and statistical significance alone, as the GRADE approach currently encourages, we consider the whole posterior distribution of the parameter in the calculation of the index.
The present study lacks a clinician's advice on plausible missingness scenarios tailored to the motivating examples. In addition, we have considered five scenarios for the IMOR and IMDOM values with a relatively narrow range, but more options could be explored. Researchers are encouraged to consider more scenarios or scenarios in a broader range, if plausible. However, the convergence time will be inevitably longer, especially in large networks of interventions. We do not recommend defining less than five scenarios for the missingness parameter as the sensitivity analysis may be inadequate.
However, not a limitation per se, we have used Bayesian methods for offering flexibility in the analysis of aggregate MOD and for being popular in NMA. Our proposed RI can be applied straightforwardly in the frequentist framework since the estimated summary treatment effect and uncertainty thereof are the necessary components to calculate the KLD measure and obtain the RI. Lastly, the proposed RI is not immediately applicable to the assessment of possible inconsistency, as the threshold of robustness has been exclusively defined on summary treatment effects. We need to define what constitutes considerable inconsistency (measured as the difference between the direct and indirect effect) to infer whether different missingness scenarios compromise the consistency assumption.

| CONCLUSIONS
Consistent with previous recommendations for proper handling of aggregate MOD, we propose and strongly advise the following sensitivity analysis plan. As a first step, researchers should seek a clinician's opinion for: (a) the MOD scenarios that are plausible for the interventions and conditions under investigation, and (b) the threshold of robustness that is proper for the aims of the systematic review. Ideally, this consultation should occur during the protocol development of the systematic review. The second step entails appropriate synthesis of trials with MOD to obtain the posterior distribution of the summary treatment effects for all possible scenarios in one stage and then, calculate the RI. The final steps of the proposed process are to compare the RI with the selected threshold to conclude the robustness of the primary analysis results and present the enhanced balloon plots for the summary treatment effect of comparisons where lack of robustness was found (if any). Then, the enhanced balloon plots can help detect the missingness scenario that led to deviant results from the MAR assumption regarding the magnitude, uncertainty, and conclusiveness of the estimated parameter (summary treatment effect or between-trial variance).