Network meta‐analysis results against a fictional treatment of average performance: Treatment effects and ranking metric

Network meta‐analysis (NMA) produces complex outputs as many comparisons between interventions are of interest. The estimated relative treatment effects are usually displayed in a forest plot or in a league table and several ranking metrics are calculated and presented.

may be given in a forest plot against a common reference treatment or in a league table, where the names of the treatments are presented in the diagonal and each cell contains the relative treatment effect. 3 Such a table allows the simultaneous presentation of two outcomes, or of the results from pairwise and network meta-analysis, below and above the diagonal. Additionally, by-products of relative treatment effects are often presented as ranking metrics of the included treatments. Results from NMA are often used to inform health-care decision-making 4,5 and ranking metrics constitute an attempt to present such results in a coherent and understandable way.
Several ranking metrics have been proposed to present NMA results, each one answering a different question. Ranking probabilities of each treatment being at each possible rank are calculated using simulation or resampling techniques either in a Bayesian or in a frequentist framework. Other ranking metrics include the surface under the cumulative ranking curve (SUCRA), that averages across all ranking probabilities for each treatment, and its frequentist analogue, P-score, which is calculated analytically. 6,7 SUCRA and P-score can be interpreted as the mean extent of certainty that a treatment is better than all the other treatments. As authors of 6 point out, however, "it is impossible to tell what constitutes a modest or large difference in SUCRA between two treatments, either statistically or clinically".
In this paper, we present an alternative parametrization of the NMA model and we use it to develop a probabilistic ranking metric that naturally incorporates uncertainty and is a viable alternative to existing ranking metrics. In section 2, we re-parametrize the NMA model to derive treatment effects against a fictional treatment of average performance using the deviation of means coding that has been used to parametrize categorical covariates in regression models. 8 In section 3, we use the derived treatment effects to compute the probability of each treatment being better than the "average" treatment. This ranking metric aids the interpretation of NMA results in classifying treatments as superior, equivalent and inferior to an imaginary 'average' treatment.

| Deviation from means coding in regression models
We start with a short description of the deviation from means coding in regression models as described by Hosmer and Lemeshow. 8 This is an alternative parametrization to the most common "reference cell coding" in order to avoid the use of a reference level. According to the reference cell coding, a categorical independent variable with C categories is expressed through C − 1 dummy/indicator variables.
Consider, for example, that we aim to estimate the effect of a covariate with four groups on the probability of an event. We fit a logistic regression model where x = (x 1 , x 2 , x 3 ) 0 are the dummy variables for the covariate and g(p(x)) is the logit link function with p(x) indicating the probability of event. According to the reference cell coding, the indicator variables are parametrized as shown in Table 1 and result into estimating logarithms of the relative odds ratios (logOR) between the categories represented by the values 0 and 1 in these indicator variables.
According to the alternative deviation from means coding, the indicator variables express effects as deviations between each category mean (here the logit of the outcome in that category) from the overall (grand) mean (here the average logit outcome over all categories) as shown in Table 1. The model results in estimating the coefficients, interpreted as the relative effects among groups vs the average effect across all groups. Note that the exponential of the coefficients are not odds ratios because in the denominator is the average odds that includes the odds of the numerator. For further information and examples on the deviation from means coding, see. 8

| Notation for the NMA model
In this section, we introduce some general notation for the NMA model. Let the entire evidence base consist of i = 1, …, n studies forming a set of treatments, denoted as k = 1, …, K. The number of treatments in study i is denoted as K i . Index j denotes a treatment contrast. A core assumption in NMA is that of transitivity, which implies that in a network of K treatments, and subsequently K 2 possible relative treatment effects, only K − 1 need to be estimated and the rest are derived as linear combinations of those. 9,10 The target parameter is therefore a vector μ of K − 1 relative treatment effects μ 2 , μ 3 , … μ K , called the vector of basic parameters. 11,12 With arm-level data, we can model arm level parameters, for example the event probability for a binary outcome, in study i and treatment arm k denoted as y ik (13). A link function g(y ik ) maps the parameters of interest onto a scale ranging from minus to plus infinity and u i are the trial-specific baselines. For an overview of commonly used link functions in meta-analysis see. 13 All arm-level parameters y ik across studies are collected in a vector y a of length With contrast-level data we model trial specific summaries, for example logOR, log risk ratio, mean difference or standardized mean difference. 14 Let y ij be the observed effect size for treatment contrast j in study i. The vector of the estimated contrasts across all studies is denoted as y c and is of length Þ. The superscript c indicates the fact that "contrast-level" data are modeled.
We will first describe the arm-level (section 2.3) and then the contrast-level (section 2.4) NMA models using reference cell coding and the equivalent alternative deviation from the means parametrization, which allows estimation of all treatments vs a fictional treatment of average performance. Sections 2.3 and 2.4 can be read independently, that is, the reader can skip one of the two sections. Alternatively, the reader already familiar with the NMA models that use reference cell coding can skip 2.3.1 and 2.4.1. Table 2 can be used as a reference to the four forms of the NMA model (arm-level and contrast level with reference cell and deviation from the means coding), in case parts of the remainder of section 2 are skipped.
We will exemplify the models using a hypothetical network of three treatments, A, B and C examined in four studies, one comparing A and B, one comparing A and C, one comparing B and C and one three-arm study comparing treatments A, B and C. The target vector of basic parameters is usually taken to include the relative effects of all treatments vs an arbitrary reference, here treatment A, and hence is μ = μ AB μ AC . The transitivity assumption implies consistency between relative treatment effects; in particular, it holds that The model for study 1, comparing treatments A and B is shown in Table 2; δ 1,AB denotes the random effect of study 1 for the comparison AB and τ 2 denotes heterogeneity. It is customary to assume that heterogeneity is common across comparisons. The model is straightforwardly generalized for the other three studies ( Table 2). In its general form, the NMA model using arm-based analysis can be written as where u is the vector of baselines u i of length n, which can be assumed to be either fixed and unrelated to each other, or exchangeable drawn from a normal distribution. 15 We assume fixed and unrelated baseline effects for the remainder of this paper. Vector δ includes the study random effects δ i,j and follows the multivariate normal distribution Matrix Σ is a block-diagonal between-study variancecovariance matrix of dimensions P n i = 1 . The matrices Z,X a ,W are design matrices linking the vector of baselines, basic parameters and random effects respectively with g(y a ). The construction of these design matrices depends on the modeled arm-level parameters y ik and is exemplified in the following example. For the example of Table 2, Equation 1 takes the form Matrix X a indicates which elements of μ are estimated by each g(y ik ). It contains one row per study arm and one column per basic parameter. The first row corresponds to treatment arm A of the first study taking the value 0 both for μ AB and μ AC . The second row indicates that μ AB is estimated in treatment arm B of the first study. Similarly, the construction of the next rows of X a , as well as that of Z and W, is implied by the arm-level data included in each study and the subsequent elements of μ to be estimated ( Table 2).

| Deviation from means coding
The above model in Equation 1 can be modified using the deviation from means coding. 8 The model will be parametrized in such a way to estimate the effects of each treatment vs the "average" treatment. The target parameter of this model is a vector b that includes K − 1 parameters b k with k = 2, …, K, which are the effects of treatment k vs the average effect over all treatments. One of the treatments-here treatment 1-is arbitrarily chosen to be excluded for identifiability. Results do not depend on the choice of this "reference" treatment.
For the deviation from means coding, the model will be with X a * denoting the modified design matrix. The matrices Z and W remain unchanged. The new design matrix X a * will take values −1 for the arbitrarily chosen treatment that is not included in vector b; all other entries in the matrix are as in X a .
Consider the example of Table 1 and the first two rows of the X a matrix, 0 0 1 0 , corresponding to the first study. According to the deviation from means coding as illustrated in Table 1, we chose a treatment (here treatment A) for which X a * will take −1 for both dummy variables (both columns of the design matrix) and the group corresponding to treatment B takes 1 and 0 for the two columns of the design matrix, as in X a . Thus, the respective part of the new design matrix will be −1 −1 1 0 . The model for study 1 with the alternative parametrization is where the parameters b B and b C denote the effects of B vs average treatment and C vs average treatment respectively. The effect of A vs the average treatment is −b B − b C and the relative effect of B vs A for the study 1 is derived as The models for all studies are given in Table 2 and the full model is written as Note that the reparametrization described using the deviation from the means coding should not be confused with different parametrizations of the NMA model to produce relative treatment effects of all treatments vs each other. We present in the Additional file 1 an example of different parametrizations for specifying the means using reference cell coding and deviation from means coding using armlevel data.

| Reference cell coding
In the contrast-level NMA, data from K i − 1 contrasts for each study are modeled. The model for study i and treatment contrast j is written as with ε ij being the random error for study i and treatment contrast j where s 2 ij is the sample variance of y ij . The random effect δ ij is defined as in the NMA with arm-level data. For example, for the first study the model is and, similarly, for the other studies the models are given in Table 2.
The contrast-based NMA model in its general form is then written as with the vector of random effects δ having the distribution given in the arm-level NMA model and the vector of random errors being distributed as where S is the block-diagonal within-study variancecovariance matrix of the same dimensions as Σ. The design matrix X c has dimensions P n i = 1 The entries in each row describe the relationship between the vector of basic parameters μ and the vector of observed contrast-level data y c .
For example, in the illustrative network of three treatments and four studies, the full model is written as The first row of the X c matrix indicates that the first two-arm study estimates μ AB . Note that the arm-level model using reference cell coding for study 1 implies that and, consequently, the first row of the X c matrix results as the subtraction of the second minus the first row of X a .

| Deviation from means coding
The reparametrized model will differ from that presented in Equation 3 in two ways; the target parameter to be estimated, which again are the relative effects b against an "average" treatment, and the design matrix X c * . The matrix X c * can be easily obtained from X a* by subtracting its rows within each study contrast. In its general form, the model is Consider in our example the part of X a* corresponding to study 1, , then the row of X c * corresponding to that first study will be (2 1), which is the subtraction of the two rows. This is also evident considering that according to the arm-based model using the deviation from means coding.
The models for studies 1 to 4 are given in Table 2 and can be written as The estimation of b in the contrast-based NMA model using deviation from means coding (Equation Vectorb includes the estimation of the K − 1 parameters b k for k = 2, …, K. The estimation of the effect of treatment k = 1, which was chosen to be excluded for identifiability, vs the average effect is given aŝ Note that results do not depend on the choice of reference treatment.
Network estimatesμ N can be derived as linear combinations ofbμ and are equivalent to the network estimates derived using reference cell coding. Matrix Y * of dimensions K 2 × K −1 ð Þ is constructed similarly to X c * and connectsb with network estimatesμ N . We can use several methods for estimating Σ such as likelihood-based methods and an extension of the DerSimonian and Laird method. 11,16 For the worked example, it holds that The contrast-level NMA model can be written as a two-stage model, as first described in, 11,17,18 where results of separate pairwise meta-analyses are used instead of y c in the model described in Equation 3. Constructing the respective design matrix follows the logic of constructing X c and its modification to parametrize the model using the deviation from means coding is straightforward.

| PRETA: PROBABILITY OF A TREATMENT BEING PREFERABLE THAN THE AVERAGE TREATMENT
Applying the deviation from means coding in NMA models results in the derivation of the effects of each treatment against a fictional treatment of 'average' performance. In this section we use the K estimated parametersb k to compute the probability of each treatment being better than the average treatment. To do so, we follow similar steps as those followed by Rücker and Schwarzer who derived the frequentist analogue of SUCRA, P-score. 7 Intermediate to the calculation of P-scores is the derivation of the probability that treatment k is better than treatment l, calculated as assuming that higher values represent a better outcome. Accordingly, the probability that treatment k is better than the fictional treatment of average performance (PreTA) can be derived as The range of values for PreTA k is (0.5, 1) ifb k > 0, and (0, 0.5) ifb k < 0. As it is the case with P-scores, the mean of PreTA k across all treatments is 0.5; this means that across all treatments, the mean extent of certainty that a treatment is better than the fictional treatment of average performance is 0.5. Alternatively, the z-scoreb k ffiffiffiffiffiffiffiffiffiffiffi ffi varb k ð Þ p can be used to classify treatments according to their "distance" from the fictional treatment.
Of note is that the above calculations assume normality of the estimated parametersb k . However, asb k are not effect sizes expressed for example as logOR or mean differences, using them for hypothesis testing is not meaningful. Despite that, drawingb k along with the associated 95% confidence intervals can be useful in capturing uncertainty around the ranking produced by relative treatment effects.

| Comparison of PreTAs with existing ranking metrics: theoretical considerations and empirical analysis
The, usually called, probability of being the best (pBV) is a popular ranking metric, usually calculated as the frequency that a particular treatment ranks in the first place, compared to the other alternative treatment options. pBV is interpreted as the probability of producing the best outcome value in a network of interventions (eg, large effects for a beneficial outcome, or small effects for a harmful outcome). While its derivation might be sensible in some cases, we should not overlook the fact that it only takes into account one tail of the treatment effects' distributions; for example, it does not account for the probability to produce a small effect on a beneficial outcome. SUCRAs and P-scores are useful summaries of the entire ranking distributions; suggested interpretations include "the average proportion of competing treatments, which produce outcome values worse than treatment k" and "the mean extent of certainty that treatment k produces better values than all other treatments". 7, 19 We performed an empirical comparison of the treatment hierarchies obtained with PreTA, pBV and SUCRA, calculated using parametric bootstrap in a frequentist framework. The agreement between ranking metrics was measured using Kendall's tau. We used a previously described database of NMAs published until 2015 including networks of four or more interventions. 4 We included networks with available outcome data in arm-level format, for which the primary outcome was analysed either as binary or as continuous. We used the effect measure used in the original review. Details about the inclusion criteria of the NMAs included in the database can be found in. 4 The empirical analysis was performed with the use of the nmadb package in R. 20 Results of the empirical analysis are presented in section 5. In the following section, we illustrate our method in two networks of interventions, for which at least some disagreements between pBV, SUCRAs and PreTAs occur.

| Network of antidepressants
We illustrate the derivation of the method using as an example a recently published NMA comparing the effectiveness of antidepressants for major depression. 21 The primary efficacy outcome was response measured as 50% or greater reduction in the symptoms scales between baseline and 8 weeks of follow up and results were presented as ORs. The authors aimed at comparing active antidepressants and considered the inclusion of both head-to-head and placebo-controlled trials. The network comprised 522 double-blind, parallel, RCTs comparing 21 antidepressants or placebo. In line with previous empirical evidence, 22,23 the authors have found evidence that the probability of receiving placebo decreases the overall response rate in a trial and dilutes differences between active compounds. 24 Based on this ground, authors of this NMA 21 synthesized only head-to-head studies separately to estimate the relative efficacy of active interventions. Here, we will focus on the latter network that included 179 headto-head studies comparing 18 antidepressants ( Figure 1A).
Authors presented relative treatment effects between all pairs of the 18 antidepressants in a league table (figure 4 in 21 ). When effect sizes are used to rank treatments, selecting a reference treatment against which to draw a forest plot of NMA effects is of particular importance. Although the choice of reference does not affect the estimates obtained, the uncertainty around NMA effects depends on the precision with which the selected reference treatment is associated. Figure 2 shows the relative treatment effects against fluoxetine and vortioxetine, the treatments that have been studied most and least respectively. While results are equivalent, choosing to present one over the other forest plot might implicitly lead to different interpretations on the similarity between the drugs based on visually inspecting the overlap of the confidence intervals. Figure 2 also shows the derived odds of each treatment vs the odds of a fictional treatment of average response with their confidence intervals. The line of no effect is included in the graph for illustration reasons, although eb k are not suited for hypothesis testing. Presenting eb k with their confidence intervals offers a solution to the ambiguity of selecting a reference treatment, in terms of the uncertainty around them and the consequent conclusions about similarity of treatments. This example shows that presenting the effects vs a fictional treatment of average performance in a forest plot, in addition to a league table presenting all relative effects, might be a viable option in networks with many treatments and in absence of a "natural" reference treatment. Figure 3 shows the PreTAs for the 18 antidepressants; treatments around 0.5 are the treatments closest to the fictional treatment. Vortioxetine has the largest point estimate against the fictional treatment but its estimation comes with great uncertainty. Escitalopram vs fictional is more precisely estimated in favor of escitalopram and it F I G U R E 1 A, Network plot of head-to-head randomized control trials comparing 18 antidepressants. B, Network plot of head-to-head randomized control trials comparing 4 interventions for heavy menstrual bleeding. First and second generation interventions refer to endometrial destruction. Nodes and edges are unweighted [Colour figure can be viewed at wileyonlinelibrary.com] is associated with the greatest PreTA (97%). Duloxetine and milnacipran are the treatments closest to the fictional treatment. The point estimate of nefazodone vs the average treatment is slightly smaller than that of duloxetine. Due to the associated uncertainty, however, there is 34% probability that nefazodone is superior to the fictional treatment, compared to 52% of duloxetine. Fluoxetine, clomipramine, fluvoxamine, trazodone and reboxetine are among the worst treatments in the network, either because of their point estimates against the fictional treatment or because of the respective precision in the estimation. It should be noted that the hierarchy illustrated in Figure 3 refers only to one outcome and does not take into account more complex hierarchy questions. Table 3 summarizes the ranking metrics for the network of antidepressants; pBV, the SUCRA and PreTAs are presented. 6,7 Escitalopram, which is the first treatment according to PreTA, ranks second according to SUCRA and third according to pBV. The disagreement between PreTA and pBV is explained by the fact that pBV favours vortioxetine and bupropion over escitalopram because of the mass under the right tail of the treatment effects' distribution. The small disagreement between PreTA and SUCRA reflects their different interpretations: vortioxetine, ranked first according to SUCRA, beats on average a larger proportion of treatments compared to escitalopram (0.90 vs 0.83) but escitalopram has a larger probability to be better than the fictional average treatment compared to vortioxetine (0.93 vs 0.87). Similarly, fluoxetine ranks last according to PreTA whereas it is followed by trazodone and reboxetine according to SUCRA. This disagreement arises from the fact that the smaller varb k for fluoxetine leads in a greater certainty that it is worse than the fictional treatment.

| Network of interventions for heavy menstrual bleeding
We use as a second example a network of interventions for the treatment of heavy menstrual bleeding. The following four interventions were compared: levenorgestelreleasing intrauterine system (Mirena), first generation endometrial destruction, second generation endometrial destruction and hysterectomy. 25 The primary outcome was patients' dissatisfaction at 12 months and the network included 20 studies ( Figure 1B). Figure 4 shows the treatment effects of the four treatments compared to a fictional average treatment and Figure S1 illustrates the relative position of each treatment according to its probability of being superior (green) or inferior (red) than the average treatment. There is a clear advantage of hysterectomy compared to the other three treatments with no treatment lying close to the "average treatment area" (0.5 of PreTA). Abbreviations: pBV, probability of producing the best value; PreTA, preferable than average; SUCRA, surface under the cumulative ranking curve.
F I G U R E 4 Odds of each treatment vs odds of a fictional treatment of average response expb k , probability of each treatment being better than the average (PreTA), probability of producing the best value (pBV) and SUCRA in the network of head-to-head studies comparing four interventions for heavy menstrual bleeding. Numbers in parentheses under PreTA, pBV and SUCRA represent ranks. CI, confidence interval; PreTA, preferable than average; pBV, probability of producing the best value; SUCRA, surface under the cumulative ranking curve In this example, hysterectomy outperforms the other three treatments and ranks first according to all ranking metrics (PreTA: 0.99, pBV: 0.97, SUCRA: 0.99, Figure 4). Similarly, all ranking metrics agree that first generation endometrial destruction is the least preferable option (PreTA: 0.01, pBV: 0.00, SUCRA: 0.17, Figure 4). The disagreement between ranking metrics occurs for the second and third position between Mirena and second generation endometrial destruction. The two interventions are similar according to the point estimates but second generation is more precise. This leads to a greater certainty that second generation is worse than the average treatment compared to Mirena, resulting in a smaller PreTA (0.12). However, second generation beats on average more treatments than Mirena does since the relative effect of second generation is larger than that of Mirena; this results in a larger SUCRA for second generation (0.47) than for Mirena (0.37).

| RESULTS OF THE EMPIRICAL ANALYSIS
We ended up with 232 networks to be included in the empirical analysis. There was strong agreement between hierarchies obtained by PreTAs and SUCRAs, shown by a median Kendall's tau (in the following called "correlation") of 0.94 with interquartile range (IQR) 0.86 to 1.00). Almost half of the networks (101, 44%) had correlation of 1, while only two networks (1%) had correlation less than 0.6. The network with the smallest correlation (0.4) is shown in Figure S2 26 ; it is network of five treatments, where four of them have similar treatment effects compared to the fifth one. Thus, uncertainty in the produced treatment hierarchy is high and results in disagreement between PreTA and SUCRA rankings. The agreement between PreTAs and pBV was lower with a median correlation of 0.74 (IQR 0.61 to 0.89) and 49 networks (21%) having correlation less than 0.6 ( Figure S3).
As with all ranking metrics, any disagreements between PreTAs and pBV or SUCRAs are attributed to the different ways they incorporate uncertainty in the estimation. Among treatments with similar point estimates, pBV favors treatments associated with uncertainty, as the tail of the distribution of treatments with uncertain effects is larger compared to the tail of the distribution for treatments with similar point estimate but high precision. The probability P kl tends to 0.5 with increased varμ N kl À Á ; consequently, the greater the uncertainty associated with a treatment, the more its P-score tends to 0.5. A recent empirical analysis investigates the role of uncertainty in the agreement between ranking metrics and a research paper describing theoretically the interpretation and the role of uncertainty in the various ranking metrics is in preparation. 19,27 6 | DISCUSSION In this paper, we derived the relative treatment effects of all treatments vs a fictional treatment of average performance. To that aim, we applied the alternative deviation from means coding to the construction of design matrices in NMA models. The application of the resulted coefficients is 2-fold. First, they can be used to conveniently present NMA results in large networks without an obvious reference treatment. Such a presentation would by no means substitute the presentation of a league table, or any other way of presenting all NMA relative treatment effects, in the main manuscript or in the appendix of an NMA application. On the contrary, it may only serve as a complementary presentational tool for a quick grasp of evidence. Second, we developed a new ranking metric, PreTA, interpreted as the probability of each treatment being preferable than a fictional treatment of average performance. PreTAs can be produced in all NMAs as long as the eligibility of treatments is well justified. The notion of the average treatment refers to the average absolute efficacy among the treatments included in the systematic review. Thus, as with all ranking metrics, the interpretation of PreTAs is subject to the set of treatments compared.
The usefulness of the interpretation of theb k coefficients depends on whether the notion of an 'average' treatment makes sense. This challenge in interpreting the coefficients, and subsequently PreTA, however, may be less pronounced in NMA compared to other applications of regression models. This is because for most categorical explanatory variables the average category is meaningless. A category of "average" (eg, sex or ethnicity) is impossible to have and difficult to interpret and this is likely the reason that the "deviation from means" coding is very rarely used in practice. In NMA, however, treatment effects are distributed on a continuous scale and therefore the average treatment effect is a possible value that in theory a treatment could take. A further limitation of our method is that researchers may be inclined to use hypothesis testing when interpreting theb k coefficients, which is not suitable. Moreover, the coloring of Figure 3 and Figure S1 may lead to overinterpretation of the treatment hierarchy based on the dichotomy of being better or worse than the fictional average treatment. It should be noted that being better or worse than the average treatment does not necessarily mean that a treatment is good or bad; treatments may be more or less similar between them and the entire treatment effects' distributions is the only way to get all the information about all possible comparisons.
In the presence of a reference treatment, for example, placebo, a simple and intuitive non-probabilistic ranking metric can be obtained by ranking all relative effects against placebo. Authors of NMA often present estimated treatment effects against placebo or standard care in a forest plot, providing implicitly or explicitly a treatment hierarchy. While such a hierarchy might be appropriate in many settings, they assume that treatment effects against placebo are of primary interest for the analysis. This might not be the case in other healthcare areas where one or more established therapies exists 28 or where researchers are concerned about the quality of the evidence from placebo-controlled studies [29][30][31] and choose to, exclusively or complementary, analyse a network without placebo. Moreover, it should be taken into account that the amount of data associated with the reference treatment might have an impact on the judgement regarding the similarity of the treatments, when such a judgement is made by visually inspecting a forest plot of NMA effects. Point estimates against the fictional average treatment provide a solution to this ambiguity.
Alternative methods to avoid the reference group coding have been suggested in the literature. The application of quasi-variances, 32 independently proposed as "floating absolute risks" in epidemiology, 33 do avoid setting a reference group. However, the scope of their use pertains to approximating a set of variances of the model contrasts such that the variances between any linear combination of contrasts can be derived without the disposal of the covariance matrix. 34 Thus, quasi-variances approaches target a different problem from the model described in this paper and the relevance of the estimated quantities to NMA is not clear.
Producing a treatment hierarchy in NMA is popular, with 43% of published NMAs presenting at least one ranking metric, 4 but also debatable. Recent developments tackle common criticisms against ranking metrics, pertaining to arguments that they are unstable, 35,36 uncertain, 37 do not differentiate between clinically important and unimportant differences, 2,38 do not account for multiple outcomes 39 and are not accompanied by a measure of uncertainty. 40 In particular, recent developments include extensions of P-scores for two or more outcomes, 41 incorporation of clinically important values in their calculation, 41 application of multiplecriteria decision analysis 42 and partial ordering of interventions according to multiple outcomes. 43 PreTAs can be easily extended to incorporate clinically important values as shown in 41 ; such probabilities will then be interpreted as the probability of a treatment being better than the average by at least a certain value.
PreTA is a viable alternative to existing ranking metrics, that can be interpreted as a probability and takes into account the entire ranking distribution. As it is also the case with PreTA, all existing ranking metrics use the distribution of NMA treatment effects to produce a hierarchy of the treatments. This hierarchy can be based either on probabilities like "which is the probability that each treatment produces the best outcome value" or "which is the probability of treatment A beating treatment B" or summaries of these probabilities. Rankograms visualise the entire ranking distributions for each treatment and SUCRAs, P-scores and mean ranks summarise these probabilities in a single number for each treatment. The interpretation of these summaries is, however, not always straightforward. The development of PreTAs enriches the decision-making arsenal with a presentational and ranking tool, which can be interpreted in a clinically meaningful way.

CONFLICTS OF INTEREST
TAF reports personal fees from Mitsubishi-Tanabe, MSD and Shionogi and a grant from Mitsubishi-Tanabe, outside the submitted work; TAF has a patent 2018-177 688 pending.

AUTHORS' CONTRIBUTIONS
AN conceived the idea, contributed to the modelling, produced the results and wrote the R code and the first draft of the manuscript. VC contributed to the analysis. TP contributed to the modelling and to the R code. DM, TAF and GS contributed to the modelling, reviewed the R code and contributed to the writing. All authors read and approved the final manuscript.

DATA AVAILABILITY STATEMENT
Outcome data and the code for applying our methods are available in https://github.com/esmispm-unibe-ch/ alternativenma.