Degree irregularity and rank probability bias in network meta-analysis

Network meta-analysis (NMA) is a statistical technique for the comparison of treatment options. The nodes of the network are the competing treatments and edges represent comparisons of treatments in trials. Outcomes of Bayesian NMA include estimates of treatment effects, and the probabilities that each treatment is ranked best, second best and so on. How exactly network geometry affects the accuracy and precision of these outcomes is not fully understood. Here we carry out a simulation study and find that disparity in the number of trials involving different treatments leads to a systematic bias in estimated rank probabilities. This bias is associated with an increased variation in the precision of treatment effect estimates. Using ideas from the theory of complex networks, we define a measure of `degree irregularity' to quantify asymmetry in the number of studies involving each treatment. Our simulations indicate that more regular networks have more precise treatment effect estimates and smaller bias of rank probabilities. We also find that degree regularity is a better indicator of NMA quality than both the total number of studies in a network and the disparity in the number of trials per comparison. These results have implications for planning future trials. We demonstrate that choosing trials which reduce the network's irregularity can improve the precision and accuracy of NMA outcomes.


INTRODUCTION
Meta-analysis is an important statistical technique used to combine the results of multiple randomised controlled trials.Often, individual trials have small sample sizes and involve subjects taken from a reduced population.Because of this, it is desirable to systematically integrate results from different trials that address the same clinical question.Over the last four decades meta-analysis has therefore become invaluable for the comparison of treatment options 1 .
Conventional meta-analysis focuses on pairwise comparisons of treatments.More recently however, network metaanalysis (NMA) has emerged as a technique for making inferences about multiple competing treatments.NMA allows one to combine data from multiple trials even when different trials test different sets of treatment options.The term 'network meta-analysis' derives from a graphical representation of the treatments and trials.The nodes of the network graph are the different treatment options and the connecting edges represent comparisons made between the treatments in the trials.NMA combines both direct and indirect evidence for the assessment of treatments.This makes it possible to compare treatments that have not been tested together in any trial 2,3,4,1 .
Bayesian NMA in particular has undergone substantial development over recent years.At the same time, it is also recognised that further research is required to fully understand its limitations and to improve the method 5,6 .In this context, simulation studies are frequently used to evaluate the performance of NMA and the factors affecting its accuracy 7 .This approach involves setting up a model (e.g. a fixed-effect or random-effect model 8,2 ) with parameters whose numerical values can be fixed at the beginning.The model is used to produce synthetic (computer-generated) trial data 9 .Estimates of the model parameters are then obtained by feeding this synthetic data into the NMA method.The outcome of the NMA can then be compared against the known model parameters.
A key strength of this approach is the ability to systematically vary the parameters of the model.For example, different relative treatment effects can be explored, or the structure of the network of treatments and trials can be changed.This allows one to systematically investigate the performance of NMA in a range of different scenarios, and to determine the nature of any inaccuracies or biases.To do this, however, Bayesian NMAs must be carried out for many realisations of the synthetic trial data.The overall computational effort can be considerable because the NMA method relies on extensive Markov Chain Monte Carlo sampling 10,11 .
The primary outcomes of an NMA are estimates of relative treatment effects, and the corresponding credible intervals.Bayesian NMA allows one to rank treatments based on these relative effects, providing a convenient summary for clinical decision making.However, simply ranking treatments as best, second best and so on can be misleading as it does not take into account the level of overlap between credible intervals 12 .
As a consequence, a number of other metrics have been developed to compare treatments.One such metric focuses on so-called 'rank probabilities'.These quantify the degree of certainty with which each treatment is believed to be the most effective, second most effective, etc., based on the available trial data.Results are often reported in terms of so-called SUCRA values ('surface under the cumulative ranking curve').These values condense rank probabilities into a numerical summary 13 , and reflect both the magnitude and the uncertainty of treatment effect estimates 6,14,15 .
Ranking methods have attracted considerable interest in recent years 14,16,17,18,19,20 .It is generally recognised that the accuracy of ranking statistics and the treatment effect estimates are likely to be affected by the geometry of the network of treatments and trials 21,22,17,23,5 .PRISMA guidelines ('Preferred Reporting Items for Systematic Reviews and Meta-Analyses') therefore recommend that authors provide graphical and qualitative descriptions of network geometry 6 .
A previous simulation study found that the probability of being ranked first is overestimated for the treatment that is tested in the fewest studies in a given network, and underestimated for the treatment included in the most studies 24 .It has been suggested that this is due to differences in the precision of treatment effect estimates 15 .Overall it is generally accepted that reporting only the probability of being best can lead to erroneous conclusions 17,25,14 .Indeed, the current advice from the PRISMA guidelines is to report the probability that each treatment has each rank 6 .
In practice however, the most common ranking statistic in NMAs continues to be the probability that each treatment is ranked best 26 .Previous research on the utility of rank probabilities has also focused almost exclusively on the probability of being ranked best.As a result there is very limited evidence on the validity of reporting the full set of rank probabilities or the SUCRA values.Furthermore, due to a lack of appropriate data-generating models and the high computing power required to carry out Bayesian NMAs, simulation studies have been limited to fixed-effects models or networks of two-arm trials only 24,27 .Some progress been made in relating characteristics of network geometry to the outcome of NMA 22,21,28 .However, it is largely unexplored how exactly these metrics relate to the performance of ranking statistics and treatment effect estimates 5 .
The purpose of our work is to study how the structure of the network affects the probability that each treatment is ranked first, second and so on.In particular we go beyond the probability of being ranked best.We also investigate the mechanisms by which the network affects rank probabilities.Building on recent advances in data-generating methods 29 , our simulation studies include random-effects models and networks of multi-arm trials.In order to characterise network geometry we introduce a measure of asymmetry in the number of studies per treatment which we call 'degree irregularity'.The network is said to be regular if all treatments are tested in the same number of studies, and it becomes increasingly more irregular the more this number varies across treatments.Through simulations of multiple network geometries we investigate how this metric affects the precision and accuracy of the treatment effect estimates, and the quality of rank probability estimates and SUCRA values.These results provide a simple method for the identification of additional trials which best complement an existing network of evidence.
The remainder of this paper is set out as follows: In Section 2 we present the relevant background information and methods.We begin by outlining the random-effects model for a network of multi-arm trials and the Bayesian approach to NMA.Following this we define the key outcomes of NMA and the relevant ranking statistics.The design of the simulation and networks are described along with details of the data-generating models.We also introduce treatment-specific and network-level quantities that allow us to compare the quality of NMA outcomes within and between networks.Section 3 contains our main results.First, we present within-network and between-network comparisons for networks with equally effective treatments and two-arm trials.We then test how well these results generalise to scenarios in which the true treatment effects vary across treatments, and we study networks involving multi-arm trials.We also compare the results from three different data-generating models.In Section 4 we summarise and discuss our main findings.We provide an example that demonstrates how our results can be used to inform the choice of future trials.
As an example, consider a network of smoking cessation data reported by Hasselblad 30 .Four treatments are compared: 1 = no contact (control), 2 = self-help, 3 = individual counselling and 4 = group counselling.The network is shown in Figure 1 and consists of 24 trials.Trial = 2 in Hasselblad 30 compares 2 = 3 treatments such that 2,1 = 1 (no contact), 2,2 = 3 (individual counselling) and 2,3 = 4 (group counselling).Trial = 6, on the other hand, compares 6 = 2 treatments; 6,1 = 2 (self-help) and 6,2 = 3 (individual counselling).In Figure 1(a), these two trials are highlighted by dashed and dotted lines respectively.Showing all 24 individual trials in this way would result in a rather cumbersome graph.It is therefore common not to indicate trials individually in the network.Instead, two treatments are connected by an edge whenever there is at least one trial involving both treatments.In the example, this means that edges are present between all pairs of treatments.For each pair, the thickness of the link is proportional to the number of trials comparing these two treatments.This turns the graph into a 'weighted network' 31 .In our illustrations, the diameter of each node is proportional to the number of participants that have received the treatment represented by the node.For the network in Hasselblad 30 , this results in Figure 1(b).
The treatments in trial are referred to as the arms of the trial.For a given trial , the treatment in arm is administered to , patients.We assume a binary outcome, i.e. the application of the treatment to a particular patient either produces an 'event', or it does not.The number of resulting events, , , is then recorded for each trial and arm.This means that trial is defined by the treatments it compares, = ,1 , … , , , and by the number of patients in each arm, , .The trial reports dichotomous data of the form ( ,1 , ..., , , ).
The model assumes that the application of the treatment in arm of trial generates events with probability , independently for each of the , patients at the end of this trial arm 32 .As a consequence of this setup, each , is a binomial random variable, for = 1, … , and = 1, … .We use a random-effects model, i.e. , may be different from ′ , ′ in different trials ( ≠ ′ ), even if , = ′ , ′ .That is to say, the effectiveness of any fixed treatment ∈ { 1 , … , } may be different in different trials.
Our analysis focuses on relative rather than absolute treatment effects.To this end, we refer to treatment , =1 as the 'baseline' treatment of trial .We write = logit ,1 for the absolute treatment effect of this trial-specific baseline.For ≠ 1 we then define the relative treatment effect ( ), The trial-specific baseline treatment effect, , is the log odds of the outcome in arm = 1 of trial , while the relative treatment effect is the log odds ratio of treatment ≠ 1 compared to the trial-specific baseline.

Random-effects model
A contrast-based approach to the random-effects model assumes the exchangeability of relative, rather than absolute treatment effects 34,35,36 .This indicates that the relative effect of two treatments and is drawn from the same distribution for any trial involving these two treatments, no matter what value the index takes for this trial.The function of the index is therefore purely to distinguish trials, it does not contain any other a-priori information about the effectiveness of treatments in that trial.We assume that the relative treatment effects for a given trial are drawn from a multivariate normal distribution This means that the relative effect ( ) of the -th treatment in trial (compared to the baseline treatment of trial ) is drawn from a Gaussian distribution with mean ,1 , , .The latter quantity is the mean effect of treatment , relative to the baseline treatment ,1 of trial .That is to say, it is the average relative treatment effect one would see in a large sample of trials comparing these two treatments.We assume that these unknown mean relative treatment effects fulfill the consistency relations The ( − 1) × ( − 1) covariance matrix in Equation (3) describes the between-trial variance of the relative treatment effects, and their correlations.Following References 37,38,4 , we will assume that its diagonal elements are all identical.We write2 for their common value.This is the variance of each ( ).We will further assume that the covariance between any two treatment effects is 2 ∕2 (these are the off-diagonal elements of ).This ensures that the relative effect ( ) − ( ′ ) between any two treatments ≠ ′ in trial has variance 2 .
The aim of network meta-analysis is to estimate the mean treatment effects for all pairs ≠ , and the heterogeneity parameter, .Given the consistency assumption ( 4), not all are independent.As a consequence, we can use treatment =

Bayesian network meta-analysis
) for the vector of mean treatment effects relative to the global baseline.The vector contains the numbers of patients in all arms in the network and contains the trial outcomes.Bayesian NMA aims to construct posterior distributions for the model parameters, ( , ), conditional on the data, ( , ).This is achieved using appropriate likelihood functions and prior distributions 40,4 .We use non-informative prior distributions for the model parameters.Specifically, we assume independent univariate Gaussian distributions  (0, 10 4 ) for each of the parameters and .The prior for is assumed to be a uniform distribution over the interval from 0 to 5 2,41 .As described by Smith et al (1995), the joint likelihood function of all the model parameters, ({ ( )}, { }, , ), can be constructed from the product of socalled conditional 'parent-child' probability distributions 40 .In the case of the random-effects model for dichotomous data we have described, Eqs. ( 1) to (3) are used to construct a likelihood which is a product of binomial and multi-variate normal distributions.
For this setup, the posterior distributions of the model parameters can usually not be obtained analytically.We therefore rely on MCMC methods, specifically the Metropolis-in-Gibbs algorithm 42,10,43,11 .MCMC approaches work by constructing a Markov chain whose stationary distribution is the probability distribution of interest.In Bayesian NMA this is the posterior distribution of the model parameters, ( , ), given the data.
Following Kibret et al ( 2014) 24 , we used a burn-in of 5×10 3 and a thinning factor of 10 in our MCMC simulations.Samples were drawn from the posterior distributions for 2 × 10 4 iterations after burn-in.

Reporting NMA outcomes
The primary outcomes from an NMA are the final estimates of the model parameters and their uncertainty (the latter is usually indicated by a 95% credible interval).In addition, Bayesian NMA allows for the calculation of rank probabilities { ( )}.The quantity ( ) is the probability that treatment is ranked -th.At each MCMC iteration the treatments are ranked from best (rank = 1) to worst (rank = ) based on the values of 1 sampled at that iteration.After discarding the burn-in, and carrying out thinning as described, the rank probabilities are estimated from the proportion of times each treatment received each rank.
Treatment effect estimates and ranking probabilities become more difficult to interpret as the number of treatments in the network increases 17,14 .In order to simplify this information, Salanti et al (2011) introduced a numerical summary, the so-called 'surface under the cumulative ranking' curve (SUCRA) 13 .The value of SUCRA for treatment is defined as where ( ) is the probability that treatment has rank or better 13,15 , Using the definition of the expected rank, it is straightforward to see that 15 In this notation SUCRA takes values between zero and one.In practice, SUCRA values are often expressed as a percentage: if a treatment is ranked first with probability one then it will have SUCRA = 1 (or 100%), and if it ranks last with certainty, then it will have a SUCRA = 0 (or 0%) 13 .

Network design
Simulations in our work were restricted to networks with = 4 treatments.Figure 2 shows the five network geometries we have used: (a) star, (b) loop, (c) complete loop, (d) tadpole, and (e) ladder.These geometries were chosen as they are commonly observed in real-life network meta-analyses; combinations of these have been previously studied in 27,24,39,22 .
Within the constraints of these geometries, the number of studies per comparison was varied (i.e., the number of trials involving a particular pair of treatments).To describe the specific geometry of a network we use the vector of the number of studies per comparison, ), where = is the number of studies that compare treatments and .The entries of define the strengths (or weights) of the edges in the network of treatments.
We note, however, that the full setup of the treatment-trial network is not fully specified by alone.This is because the same number of comparisons per pair of treatments can be achieved by different combinations of two-arm and multi-arm trials.
From we can obtain the number of studies involving treatment , In the theory of networks this quantity is referred to as the 'weighted degree' of node 31 .We will occasionally use slightly more casual language, and refer to as the 'number of studies per treatment'.We also define the average number of studies that a treatment is involved in (the 'mean degree'), In network theory, a graph is said to be 'regular' if all nodes have the same degree 31 .With this in mind, we introduce a measure of 'degree irregularity' of the network, This quantifies the variation in the number of studies per treatment.In particular, ℎ 2 = 0 when all nodes are involved in the same number of trials ( = ̂ for all ).When we make comparisons between networks we use the normalised network irregularity, ℎ 2 ∕ ̂ 2 .We note that there is a direct mapping between ℎ 2 ∕ ̂ 2 and the so-called 'probability of inter-specific encounter index' (PIE).This index is a measure of ecological diversity, and was introduced to network meta-analysis by Salanti et al (2008) 21,22 .Further details can be found in Section S1 in the Supplementary Material 44 .
Some of the key quantities we use in our analysis are summarised in Table 1.

Simulation method
In our simulations we generate dichotomous trial data for a specified network geometry and for known model parameters ( , ).An NMA is performed for multiple independent realisations of simulated data, and the resulting estimates of the model parameters are recorded for each realisation.More specifically, we used the following numerical protocol: (1) Define the fixed parameters of the network such as the total number of studies, , the vector of number of studies per comparison, , the number of participants in each arm, , , and the true model parameter values ( , ).
(e) Determine the treatment effects with respect to the baseline 1 and use the consistency relation in Equation ( 4) to output the estimated model parameters, ̃ ( ) , for all , ∈ { 1 , … , }.Also output the estimated heterogeneity parameter, ̃ ( ) , and the bias of rank probabilities, In this equation ̃ ( ) ( ) is the probability that treatment has rank in the NMA of realisation .

Data generation for simulation studies
The relative treatment effects ( ) ( = 2, … , ) in any one trial do not uniquely define the absolute treatment effects , .This is because the treatment effect of the trial-specific baseline, ,1 , is not determined by the { ( )}.Equation ( 2) can be re-arranged to give so that ,1 together with the { ( )} ( = 2, … , ) specifies all absolute treatment effects in trial .
To fully define step (2)(b) in the above algorithm it is therefore sufficient to specify the construction of ,1 .In the context of the random-effects model and to allow for the inclusion of multi-arm trials, we use three data-generating models (DGM) based on those presented by Seide et al (2019) 29 .
The other two methods are variations of the DGM "Fixed" in Seide et al (2019) 29 which we will refer to as 'Uniform' and 'Normal' respectively.The former samples ,1 from a uniform distribution between zero and one, whilst the latter samples it from a normal distribution  (0.5, 0.2), truncated at zero at the lower end, and at one at the upper end.To ensure our results were not due to the data-generating model, all simulations were performed using each method and the results compared.

Quantities indicating and characterising NMA quality
In this section we introduce quantities that measure the quality of the parameter estimates resulting from the NMA.We begin with treatment specific values.These are used to compare how well an NMA estimates the effectiveness of each treatment within a given network.
For each pair of treatments ≠ we define the mean bias of the relative treatment effect, For each fixed treatment we can also define the treatmentspecific mean bias Similarly, the treatment-specific standard deviation of the treatment effect is defined as Bias of SUCRA values and bias of probability ranks, ⟨Δ ( )⟩, are treatment specific quantities by construction.The former can be written in terms of the latter, Next we define network-level indicators allowing comparisons of the quality of NMA outcomes between networks.We introduce the total magnitude of the bias of rank probability, and the total magnitude of the bias of SUCRA, To be able to compare numerical values for these two quantities with each other, we express these indicators as proportions of the maximum values they can take, see Section S2 in the Supplementary Material 44 for further details.Finally, we introduce the total standard deviation and total bias of treatment effects, Refer to Table 1 for a summary of some of these quantities.

RESULTS
The set of simulated networks was chosen to cover a range of values for the degree irregularity ℎ 2 ∕ ̂ 2 .It includes all network geometries in Figure 2, with varying values of ).We used = 0.1 throughout, as well as an equal number of participants per arm, , = 25, and Ω = 10 3 independent realisations of synthetic trial outcomes for any fixed set of model parameters ( , ).Error bars in our figures are typically smaller than the size of the markers.
In Sections 3.1 and 3.2 we first focus on networks with equally effective treatments, = ( 1 2 , 1 3 , 1 4 ) = (0, 0, 0).Any systematic effect observed in the outcome of NMA is therefore a result of the structure of the network only.Networks with treatments of varying effectiveness are discussed in Section 3.3.

Comparisons within networks
In Figure 3 we plot bias of rank probability against number of studies per treatment, , for a star network with = (1, 5, 15, 0, 0, 0).Similar plots for other network geometries can be found in the Supplementary Material 44 (Figures S1 to  S16).This data consistently shows that the probability to be ranked best or worst, (1) and (4) respectively, is overestimated for the treatment included in the fewest studies (lowest degree ).The probabilities (2) and (3) are underestimated.The reverse is found for the treatment included in the most studies.
The bias of rank probability for the treatments with the most and fewest studies appears to be common in all networks.We find that the bias of the remaining two treatments can be affected by the position of their respective nodes in the network.Figure 4 shows the bias of (1) for a ladder network with = (1, 0, 0, 5, 0, 15).In this example treatment 2 is included in fewer studies ( 2 = 6) than treatment 4 ( 4 = 15) but has more direct comparisons (it is directly compared to 1 and 3 whereas treatment 4 is only directly compared to 3 ).The bias on (1) for treatment = 2 is found to be more negative than that of treatment = 4 .
We conclude that disparity in the number of studies per treatment generates a trend in bias of rank probabilities.It is natural to ask if a similar trend is found for bias of treatment effect estimates.This appears not to be the case (see Figures S1 to S16 in the Supplementary Material 44 ).
Instead, the trend in Δ ( ) is associated with a systematic pattern in the standard deviation of treatment effect estimates, see Figure 5.We find that SD( ) tends to decrease with On the horizontal axis, we use the normalised number of studies, ∕ , to capture how well connected a treatment is in the network.Each network contributes four data points, one for each treatment.The figure includes the data from all irregular networks we simulated.
the number of studies treatment is involved in.The standard deviation, SD( ) , is particularly high for treatments with ∕ ≲ 0.1 and appears to flatten out for those included in a larger proportion of studies.We note, however, that Figure 5 includes data from multiple networks.On inspection of individual networks, we find a slight but consistent decrease in SD( ) as the number of studies of treatment increases (see Figures S1 to S16 in the Supplementary Material 44 ).This data suggests that bias in the rank probabilities may originate from a variation in the uncertainties of the different treatment effect estimates.A possible mechanism for this is discussed in Section 4.

Comparisons between networks
So far we have mostly compared the outcome of NMA for different treatments within a given network.In this section we make comparisons between different networks.One main observation is a positive association between the degree irregularity of a network, ℎ 2 ∕ ̂ 2 , and the total bias on rank probabilities, |Δ |.The data shown as red circles in Figure 6 demonstrate this for networks with equally effective treatments.
While we find no relationship between irregularity and total bias of relative treatment effects, |Δ |, (see Figure S21 in the Supplementary Material 44 ) Figure 7 shows that the total standard deviation of the estimates of relative treatment effects, SD, increases with ℎ 2 ∕ ̂ 2 .Therefore networks with a more homogeneous distribution of studies lead not only to lower bias of rank probabilities, but also to more precise estimates of relative treatment effects.This is also a possible explanation for the vertical spread in Figure 5. Different data points for a given value of ∕ can be from networks with varying degrees of irregularity and hence they result in different outcomes for SD( ) We find that the total bias in rank probability estimates, Δ , and the total standard deviation of treatment effect estimates, SD, are not systematically affected by the total number of studies in the network (Figures S22 and S23 in the Supplementary Material 44 ).This has implications for the planning of future studies to be added to an existing network.Naively, one may assume that adding any study to an existing network will improve the quality of results because the amount of evidence is increased.However our results suggest that, in terms of bias on rank probabilities and the precision of treatment effect estimates, this is only true if the addition of the study reduces the degree irregularity, ℎ 2 ∕ ̂ 2 , of the network.
The data shown as red circles in Figure 8 demonstrate that, for networks with equally effective treatments, network irregularity has no effect on the total bias of SUCRA values across the network.Comparing the data in Figures 6 and 8 we find that the bias of SUCRA is approximately ten times smaller than that of the rank probabilities.This is consistent with the data in Figure 3 which shows that the biases of (2) and (3) are almost the exact negative of the biases of (1) and (4).These biases cancel in calculation of SUCRA in Equation ( 5).The same reasoning also explains why, when making withinnetwork comparisons, the number of studies per treatment has no effect on the bias of SUCRA (see Figure S17 in the Supplementary Material 44 ).

Treatments of varying effectiveness
The data presented so far is for networks with equally effective treatments, = (0, 0, 0).In order to test the robustness

FIGURE 6
The effect of degree irregularity on a network's total rank probability bias for networks with equally effective treatments and non-equally effective treatments.

FIGURE 7
The effect of degree irregularity on a network's total standard deviation, SD, for networks with equally effective treatments and non-equally effective treatments.
of our findings, we now focus on a case in which the four treatments have different effectiveness.Specifically, we choose = (0.5, 1.0, 1.4), and study the same network geometries as before.Treatment 1 is now the most effective, followed by 2 , then 3 and treatment 4 is the least effective.Therefore the 'true' rank probabilities are 1 (1) = 2 (2) = 3 (3) = 4 (4) = 1.0 and all other ( ) are zero.As shown in Figure 7, the relationship between SD and degree irregularity is the same as in the case of equally effective treatments ( = (0, 0, 0)); the numerical values of SD are also found to be largely similar, there is no systematic increase or reduction in the standard deviation.

FIGURE 8
The effect of degree irregularity on a network's total SUCRA bias for networks with equally effective treatments and non-equally effective treatments.
The qualitative effect of degree irregularity, ℎ 2 ∕ ̂ 2 , on the total magnitude of the bias of rank probabilities, |Δ |, is similar compared to the case of equally effective treatments (Figure 6).For ℎ 2 ∕ ̂ 2 ≳ 0.2 we find that the bias is larger for treatments with varying effectiveness than for equally effective treatments.
In our analysis of the case = (0.5, 1.0, 1.4) we find that the standard deviations, SD( ), range from approximately 0.1 to 1.0.This means that there is significant overlap in the distributions of the estimated treatment effects.As a consequence of this, the treatments appear to be similarly effective on average.For treatments with varying effectiveness, the true rank probabilities take values of either 0 or 1, whereas for networks with equally effective treatments, all ( ) are equal to 0.25.It is therefore natural that the bias of rank probabilities is greater for networks of treatments with different effectiveness, at least when the magnitude of SD( ) is of the same order or larger than the disparity in true treatment effects.
Figure 8 shows that, in contrast to the results for networks with equally effective treatments, |ΔSUCRA| increases with ℎ 2 ∕ ̂ 2 for = (0.5, 1.0, 1.4).On inspection of the biases of rank probabilities within a given network (Figure S20 in the Supplement 44 ) we find that the relationship between rank probability bias and the number of studies per treatment is affected by the quality of the treatments that have been compared.Unlike in Figure 3 (where = (0, 0, 0)), the biases on (2) and (3) are not equal to − (1) and − (4) so there is no net cancellation of biases in the calculation of SUCRA .The total bias on rank probabilities increases for more irregular networks (higher values of ℎ 2 ∕ ̂ 2 ), and as a consequence the bias on SUCRA also increases with the irregularity of the graph.

FIGURE 9
The effect of degree irregularity on a network's total rank probability bias.Data from networks with multi-arm trials is shown as blue squares.
The data in Figures 6 to 8 indicate that reducing the network's irregularity improves the precision of treatment effect estimates and reduces bias on ranking statistics in the case of treatments with varying degrees of effectiveness.Our conclusions regarding the use of network heterogeneity for the planning of future studies are therefore also valid in this more realistic scenario.

Multi-arm trials
The results presented so far are for networks made up exclusively of two-arm trials.However, approximately 85% of network meta-analyses in the literature contain multi-arm trials 45 .We therefore test if our findings generalise to networks including multi-arm trials.We focus on complete-loop networks (Figure 2(c)) as this allows us to introduce three-arm and fourarm trials without changing the overall shape of the network (a full loop remains a full loop if further trials are added to it).The networks simulated in this section are designed specifically to cover a wide range of degree irregularities.We note that including more multi-arm trials in a network will, in general, reduce its irregularity.
For a given value of the degree irregularity, ℎ 2 ∕ ̂ 2 , we generated synthetic trial data on complete-loop networks with different combinations of two-arm, three-arm and four-arm trials.We focus on the case of equally effective treatments, and report the outcome at network-level.Treatment-specific outcomes are provided in the Supplementary Material 44 (Figures S25 to S47).In all cases, the relationship between bias of rank probability and the number of studies per treatment follows the same pattern as in Figure 3.

FIGURE 11
The effect of degree irregularity on a network's total bias on SUCRA values.Data from networks with multiarm trials is shown as blue squares.
We show |Δ |, SD and |ΔSUCRA| as a function of network irregularity in Figures 9 to 11 respectively.The data from networks involving multi-arm trials is indicated by blue squares; we include the data for networks of two-arm trials (red circles) to allow comparison.As for the case of two-arm trials, the total magnitude of the bias of rank probabilities and the total standard deviation of treatment effects increase with ℎ 2 ∕ ̂ 2 , while the bias on SUCRA is largely unaffected by network irregularity.When networks are sufficiently irregular, the presence of multi-arm trials appears to reduce SD with respect to networks consisting only of two-arm trials (Figure 10).These results show that our findings concerning both withinnetwork and between-network comparisons can be generalised to networks containing multi-arm trials.

Data-generating models
All data so far was produced using the data-generating model 'Normal' (see Sec. 2.7).We also carried out a similar analysis using data from the 'Euclidean' and 'Uniform' methods.The only difference we observe is in the magnitude of the standard deviation of treatment effects.While the relationship between network irregularity, ℎ 2 ∕ ̂ 2 , and total standard deviation, SD, was not affected by the choice of DGM, the values of SD were lowest for the 'Euclidean' method and highest for the 'Uniform' method.This is not surprising as the 'Euclidean' method restricts the range of absolute treatment effects that can be sampled and thus reduces variation in the event rates.The 'Uniform' method is the least restrictive in this sense.All other results were consistent between the three DGMs (see Figures S48 to S50 in the Supplementary Material 44 ).This demonstrates that the effects we observe are due to the network geometry, and are not specific to any data-generating model.

Bias of the heterogeneity parameter,
The data in Figure 12 shows that bias of the heterogeneity parameter, , decreases with the total number of studies in the network.This is the case irrespective of whether the treatments have uniform or varying effects ( = or ≠ ), and for networks with two-arm and multi-arm trials.The bias of is not affected by the network's irregularity ℎ 2 ∕ ̂ 2 (see Figure S51 in the Supplementary Material 44 ).Therefore adding any trial to the network improves the accuracy of the estimate of .
The true value of in Figure 12 is 0.1; we note that is considerably overestimated in all cases we tested.To understand this, it is useful to recall that characterises the variation of the relative effects between any two treatments across trials in the random-effects model.Additional randomness originates from the sampling of event numbers in each trial arm.This is the case both in real-world trial data and in simulation studies (in the latter the sampling is from the binomial distributions for the respective trial arms).Some of this sampling noise may be attributed to between-trial variability by the NMA method, leading to an overestimation of .

Variation of treatment effect uncertainty is associated with biased rank probabilities
We have carried out simulation studies of network metaanalysis in random-effects models.These simulations reveal that disparity in the number of studies different treatments are involved in can lead to variation between the standard deviations of effect estimates.This in turn appears to generate a systematic bias in estimated rank probabilities.In line with previous simulations of NMA for fixed-effects models 24 , the probability of a treatment being ranked best is overestimated for treatments included in the fewest number of studies, and underestimated for treatments which are part of a large number of studies.In addition, our study of networks with four treatments found the same trend for the probability of being Treatment effect ← − better worse − → FIGURE 13 Illustrative example of posterior distributions of treatment effect estimates for four treatments in a network meta-analysis.The posterior distributions have the same mean value but varying standard deviation.Treatment 1 has the most narrow distribution, followed by 2 , 3 and 4 which has the widest distribution.ranked last.The probability of being ranked second and third best is subject to a bias in the opposite direction.These trends correspond to an increased standard deviation of treatment effect estimates for treatments compared in a smaller number of studies.
A general connection between standard deviation of effect estimates and bias of rank probabilities has previously been recognised in Rücker et al (2015) 15 .Our work establishes further details of the mechanics leading to biased rank probabilities.We illustrate this in Figure 13, where we show a fictitious example of posterior distributions for the effectiveness of four different treatments.The four distributions have equal mean values, but varying standard deviations.The distribution of treatment 4 , which has the largest standard deviation, has higher density than the other treatments at very large and very small values of the treatment effect.This means that although the most probable value of the effect of treatment 4 is the same as for the other treatments, 4 is more likely than the other treatments to have an effect that is the largest or the smallest.Therefore treatment 4 has the highest probability of being ranked best and the highest probability of being ranked worst.Conversely, treatment 1 has the lowest standard deviation.Therefore, it is less likely to have extreme values of treatment effect and thus has a higher probability of being ranked second or third.Rücker et al (2015) 15 used a similar explanation to demonstrate that the probability that one treatment is better than another can be misleading when the posterior distributions of their effects have considerable overlap.
This stylised example demonstrates that biased rank probabilities can result if the uncertainty on some treatment effects is larger than on others.This effect is also to be expected when the distributions of treatment effects have different means, provided the differences in these means are small compared to their standard deviations.
Our analysis shows that the posterior distributions of treatment effect estimates are the most narrow for treatments included in the most studies and widest for those that have been studied the least.As a consequence, biases of rank probabilities may arise if different treatments are involved in disparate numbers of studies, i.e., for large irregularity of the network.
The simulations presented in this paper are an explicit demonstration of biases that can occur in the comparison of multiple treatments.We have also suggested how they might originate from the structure of the network of treatments and trials.Understanding the origins of bias, we think, is vital for interpreting rank probabilities in network meta-analyses, and contributes to our understanding of the NMA method and its limitations.

Planning future studies to reduce the irregularity of the network
Planning future clinical trials based on existing evidence from network meta-analysis can reduce the resources and number of participants required to obtain results of a given precision 46,47,48 .While the design of future trials based on pairwise meta-analysis has received significant attention 49,50,51 , methods using the outcome of network meta-analysis are less developed.Current approaches in this area 52,47 are computationally intensive and become increasingly laborious as the network becomes more complex.Our results show that the degree irregularity of a network, ℎ 2 ∕ ̂ 2 , can provide guidance on the choice of future trials without the need for extensive simulations.The degree of a treatment in the graph is the number of trials it is involved in, and the irregularity of a network describes how this degree varies across treatments.This is easily obtained from the network.
Irregularity is a better indicator of the quality of a network meta-analysis than the total number of studies in the network.As we have shown, networks with a more homogeneous distribution of studies between treatments have more precise treatment effect estimates and smaller bias of rank probabilities.
Degree irregularity is therefore a useful metric for working out which comparisons could be made in future studies to improve the quality of an existing NMA.For example, consider a network of four treatments with = (1, 0, 0, 19, 0, 1) and ℎ 2 ∕ ̂ 2 = 0.82 as shown on the left in Figure 14.Now imagine resources are available to add ten new two-arm studies to By simulating the original network and the three 'future' networks whilst keeping all other network characteristics constant, we compare how adding the extra ten studies affects the quality of the results.Table 2 summarises the total standard deviation and total rank probability bias of these four networks.For network (a) the quality of the NMA is approximately the same as for the original network whereas (b) and (c) show a considerable reduction in SD and |Δ |.The improvement in both quantities for network (c) is markedly greater than in network (b) even though in both cases the ten new studies were added to a comparison with no existing direct evidence.
This example demonstrates that equality in the number of studies per treatment is more important than equality in the number of studies per comparison.Choosing future studies that reduce degree irregularity may therefore help to improve the precision of treatment effect estimates and the accuracy of rank probabilities.

FIGURE 1
FIGURE 1 Graphical representation of the network of treatments and trials for smoking cessation 30 .The four treatments are: 1 = no contact (control), 2 = self-help, 3 = individual counselling and 4 = group counselling.In panel (a), trials = 2 (a 3-arm study) and = 6 (a 2-arm study) are highlighted by dashed and dotted lines respectively.The thickness of each egde in panel (b) is proportional to the number of studies that make that comparison, and the diameter each node is proportional to the number of participants who received that treatment.

FIGURE 2
FIGURE 2 Network diagrams of the five network geometries considered in this study: (a) star (b) loop (c) complete loop (d) tadpole (e) ladder.
|Δ |Total rank probability bias in the network |ΔSUCRA| Total SUCRA bias in the network (b) Using the { ( )} and one of the three data-generating models (see Section 2.7), construct the probabilities , , = 1, … , , for all trials in the network.(c) For each trial arm, generate random event data, , , from the binomial distribution in Equation (1).

FIGURE 10
FIGURE 10The effect of degree irregularity on a network's total standard deviation of treatment effect estimates.Data from networks with multi-arm trials is shown as blue squares.

FIGURE 12
FIGURE12 The effect of the total number of studies in a network on the accuracy of the estimate for the heterogeneity parameter .Panel (a) is for networks made up exclusively of two-arm trials and compares networks with equally effective and non-equally effective treatments.Panel (b) includes networks with = (0, 0, 0) only and networks with multi-arm trials are shown as blue squares.

FIGURE 14
FIGURE 14Example of three different geometries that can be created by adding a fixed number of studies to an existing network.

TABLE 1
Summary of key quantities used in our analysis.

TABLE 2
Degree heterogeneity and quality of NMA outcome for the networks in Figure14.