Identifying inconsistency in network meta‐analysis: Is the net heat plot a reliable method?

One of the biggest challenges for network meta‐analysis is inconsistency, which occurs when the direct and indirect evidence conflict. Inconsistency causes problems for the estimation and interpretation of treatment effects and treatment contrasts. Krahn and colleagues proposed the net heat approach as a graphical tool for identifying and locating inconsistency within a network of randomized controlled trials. For networks with a treatment loop, the net heat plot displays statistics calculated by temporarily removing each design one at a time, in turn, and assessing the contribution of each remaining design to the inconsistency. The net heat plot takes the form of a matrix which is displayed graphically with coloring indicating the degree of inconsistency in the network. Applied to a network of individual participant data assessing overall survival in 7531 patients with lung cancer, we were surprised to find no evidence of important inconsistency from the net heat approach; this contradicted other approaches for assessing inconsistency such as the Bucher approach, Cochran's Q statistic, node‐splitting, and the inconsistency parameter approach, which all suggested evidence of inconsistency within the network at the 5% level. Further theoretical work shows that the calculations underlying the net heat plot constitute an arbitrary weighting of the direct and indirect evidence which may be misleading. We illustrate this further using a simulation study and a network meta‐analysis of 10 treatments for diabetes. We conclude that the net heat plot does not reliably signal inconsistency or identify designs that cause inconsistency.


INTRODUCTION
Network meta-analysis (NMA) is an extension of pairwise meta-analysis methods that combines direct and indirect evidence from a network of trials to calculate a treatment effect for every treatment comparison within a single statistical model. A key assumption of NMA is the consistency of direct and indirect evidence. Consistency equations were first set out by Higgins and Whitehead 1 who showed that the relative effects of different treatments could be jointly estimated by "borrowing strength" from direct comparisons to inform indirect comparisons. Inconsistency in NMA occurs when the direct and indirect evidence are not in agreement with each other. This can result in biased treatment effect estimates. Inconsistency within a network may arise when bias in direct comparisons (for example optimism bias, publication bias or sponsorship bias) acts differently in different comparisons or when treatment effect modifiers are distributed differently in different comparisons. 2,3 The power of tests for inconsistency is generally low because indirect evidence is typically a relatively weak component of most treatment estimates in NMA. Failure to reject the null hypothesis of no inconsistency does not mean that the entire network is consistent. 4 Nevertheless,the increasing use of NMA in health decision modeling means that it is important that attempts are made to identify, understand, and where appropriate, adjust for inconsistency.
As is typical in the NMA literature, throughout this paper, "design" will refer to the treatments being compared within a trial. 5 For example, two trials both comparing treatment A to treatment B will be considered to be of the same design, whereas a third trial comparing treatment A to treatment B and treatment C will be considered to be of a different design. For a full review of NMA methods, see Salanti 6 and Efthimiou et al. 7 There are several approaches for assessing inconsistency in a network; in particular, we take a closer look at Cochran's Q statistic, 8 the loop inconsistency approach, 9 the inconsistency parameter approach, 10 node-splitting, 11 and the net heat approach. 12 Between them, these five methods offer a range of increasingly complex methods for identifying inconsistency in a network. Cochran's Q statistic 8 and the loop inconsistency approach of Bucher 9 are relatively simple methods that aim to identify inconsistency through one test statistic and a p-value. Both the inconsistency parameter approach of Lu and Ades 10 and node-splitting 11 allow for inconsistency in a Bayesian hierarchical model, which allows the amount of inconsistency to be quantified and a credible interval calculated. Krahn et al 12 also use a modeling approach; however, the results are displayed graphically as a net heat plot, with the aim of allowing inconsistency to be identified, and are not linked to a statisticaltest.
Cochran's Q statistic 8 is a common method for assessing heterogeneity in a meta-analysis. The generalized Cochran's Q statistic for multivariate meta-analyses 13 can be used in the context of NMA to quantify heterogeneity across the whole network, both within trial designs and between trial designs (the latter is known as inconsistency).
Bucher 9 developed a method for assessing loop inconsistency in loops of three treatments within a network consisting of two-arm trials only. The approach involves calculating the difference between the direct and indirect evidence for a treatment comparison and testing it against the null hypothesis of consistency by referring the test statistic to the normal distribution. However, in a large network where each treatment loop is considered one at a time, multiple testing must be taken into account, and this approach can be both cumbersome and time consuming. 14,15 One of the most popular models to account for inconsistency in a network is the Bayesian hierarchical model of Lu and Ades. 10 This model is a generalization of the Bucher approach and relaxes the consistency assumption by including an inconsistency parameter in each loop in which inconsistency could occur. These additional inconsistency parameters can be fitted as fixed or random effects. Models with and without inconsistency parameters are then compared to assess whether a network is consistent and the analyst must make an arbitrary choice about this. However, in the presence of multi-arm trials, this approach depends on the order of treatments.
Cochran's Q statistic, 8 the loop inconsistency approach, 9 and the inconsistency parameter approach 10 all provide a global assessment of inconsistency in a network; however, local methods for assessing inconsistency are also needed in order to identify which treatment comparisons are driving the inconsistency. 11 Dias et al 11 first proposed comparison-specific assessment of inconsistency using node-splitting. Node-splitting involves separating out the evidence for a particular treatment comparison into the direct and indirect evidence and assessing the discrepancy between them, one treatment comparison at a time. 11 Node-splitting can be considered equivalent to the inconsistency parameter approach of Lu and Ades if all the treatment nodes are split at the same time so that separate treatment effects are estimated for each treatment comparison without assuming consistency over any set of trials. 11 To aid the identification of inconsistency within a network, Krahn et al 12 developed a method, known as the net heat plot, which could be used as a visual aid for locating and identifying any inconsistency within a network of randomized controlled trials (RCTs). The net heat plot uses Cochran's Q statistic in a fixed effect framework and decomposes it into within-trial heterogeneity and inconsistency. The net heat plot is constructed by temporarily removing each design one at a time and assessing the contribution of each design to the inconsistency of the whole network. The difference between the inconsistency in the network before the temporary removal of each design and the inconsistency that remains following the temporary removal of each design, known as Q diff , is displayed graphically in the form of a matrix. The net heat plot is then colored so that the coloring of each square indicates designs which increase or decrease inconsistency within the network.
Cochran's Q statistic, the loop inconsistency approach, the inconsistency parameter approach, and node-splitting all use formal statistical tests to draw conclusions about possible inconsistency in a network. In contrast, Q diff (the difference between two Q statistics, which themselves follow chi-squared distributions) has a nonstandard distribution and is therefore much harder to interpret. The coloring of the net heat plot is driven by Q diff , and it is unclear what value of Q diff constitutes statistically significant or clinically meaningful inconsistency.
In this paper, we take a closer look at the net heat plot and highlight some previously unremarked limitations of this approach. In Section 2, we introduce two networks of trials in lung cancer and diabetes and assess the possibility of inconsistency using a visual approach. In Section 3, we consider five methods for assessing inconsistency in NMA: Cochran's Q statistic, 8 the loop inconsistency approach, 9 the inconsistency parameter approach, 10 node-splitting, 11 and the net heat plot. 12 In Section 4, we derive algebraic expressions for the elements of the net heat plot in terms of direct treatment estimates and interpret them with the aid of numerical simulations in Section 5. In Section 6, we apply the five methods of assessing inconsistency to the lung cancer and diabetes networks before offering a conceptual critique in Section 7. In Section 8, we finish with a discussion.

DATASETS
In this section, we introduce two datasets to which we will apply methods for assessing inconsistency in NMA. We first introduce a simple three-treatment network for lung cancer (to illustrate the underlying arguments) and secondly a more complex network of 10 treatments for diabetes.

Lung cancer network
For our first network, we consider the simplest network structure possible: one treatment loop consisting of three treatments without multiarm trials. The data for this network come from three meta-analyses of RCTs in lung cancer performed by the Non-Small-Cell Lung Cancer Collaborative Group. These data were obtained from Gustave-Roussy (GR), Paris. The three meta-analyses considered three different treatments: radiotherapy (RT), radiotherapy plus sequential chemotherapy (Seq CT), and radiotherapy plus concomitant chemotherapy (Con CT) using three different designs: RT v Seq CT, RT v Con CT, and Seq CT v Con CT ( Figure 1).
The meta-analysis (MA) of RT and Seq CT was published in 1995 and included 3033 patients from 22 RCTs. 16 The current dataset was updated by GR to include some newer trials and exclude some trials using older forms of chemotherapy. This comparison now includes a total of 21 RCTs and 3387 patients. The MA of RT and Con CT was published in the work of Auperin et al 17 and included 1764 patients from 9 RCTs. This MA was also updated by GR to include a total of 16 trials and 2969 patients. The MA of Seq CT and Con CT was published in 2010 and included 6 RCTs and 1205 patients. 18 One multiarm trial (45 patients) comparing all three treatments was excluded from the network for the analyses in this paper in order to obtain the simplest network structure possible for a network meta-analysis. In total, overall survival data was available for 7531 patients from 42 RCTs. A list of all RCTs is provided in Appendix A (supplementary material).
The lung cancer network forms one treatment loop, so there can only be one inconsistency source. It provides a simple yet revealing starting point for assessing the net heat plot. To visually assess the agreement between the direct and indirect evidence within the lung cancer network, before any formal statistical models were fitted, the treatment effects for all pairwise comparisons were estimated in a number of ways. Network estimates combining both direct and indirect FIGURE 1 Lung cancer network diagram. The node size is weighted according to the number of patients randomized to each treatment, and the line thickness is weighted according to the number of studies involved in each direct comparison. Key to treatments: Con CT, radiotherapy plus concomitant chemotherapy; Pts, patients;RCTs, randomized clinical trials; RT, radiotherapy; Seq CT, radiotherapy plus sequential chemotherapy FIGURE 2 Forest plot of various analyses of the lung cancer data. All models were fitted with fixed effects. Key to treatments: Con CT, concomitant chemotherapy; CrI, credible interval (except netmeta models where confidence intervals are presented); IP, inconsistency parameter; NMA, network meta-analysis; RT, radiotherapy; Seq CT, sequential chemotherapy treatment effects were obtained by fitting a one-step IPD NMA Royston-Parmar model for time-to-event data 19,20 using a Bayesian approach and by fitting a two-step NMA using the R package netmeta. 21 An estimate of the direct evidence was obtained by fitting the one-step IPD Royston-Parmar MA model to trials directly comparing the treatments of interest only. Indirect treatment effects were also calculated using the one-step IPD Royston-Parmar MA model, where all trials directly comparing the two treatments of interest were excluded from the model. Throughout this paper, all models are fitted with fixed effects assuming no heterogeneity in any of the direct comparisons to simplify calculations in later sections of the paper. In the Bayesian estimation of the Royston-Parmar model, parameters representing the spline function for the baseline log cumulative hazard function and treatment effects were fitted with noninformative normal prior distributions. Figure 2 presents the forest plot of treatment effects for each pairwise comparison, using the methods described above and including the results of the inconsistency parameter approach, described below in Section 3.3. The forest plot clearly shows a difference between the direct and indirect evidence for each pairwise comparison.
The diabetes network contains multiple treatment loops and provides a more challenging example for assessing inconsistency. To visually assess the agreement between the direct and indirect evidence within the diabetes network, we fitted a two-step NMA using the R package netmeta 21 and obtained estimates of the direct and indirect evidence from node-splitting.

METHODS FOR ASSESSING INCONSISTENCY IN NMA
In this section, we describe five methods for assessing inconsistency inNMA.

Cochran's Q statistic
Cochran's Q statistic can be used to assess heterogeneity within a network. The overall Q statistic from the fixed effect NMA model can be decomposed into within-design heterogeneity (Q het ) and between-design heterogeneity, which is termed design inconsistency (Q inc ). Let̂i c be the treatment effect estimate from trial i for the comparison of treatments in design c with corresponding standard error̂i c , where there are 1, … , n c trials of design c. Let̂c be the treatment effect from the direct evidence for design c only with corresponding standard error̂c and̂N c be the network estimate of the treatment effect for design c; then, For multiarm studies,̂i c is a vector with variance S ic , and these formulae are extended to ∑

Loop inconsistency
From now on, throughout this paper, we use the shorthand dir to represent direct evidence, ind to represent indirect evidence and net to represent network evidence (ie, the combination of the direct and indirect evidence). In a loop of three treatments A, B, and C, we compared the direct evidence of treatment C versus treatment A,̂d ir AC , to the indirect evidence,̂i nd AC , wherêi nd AC =̂d ir AB +̂d ir BC and Var(̂i nd AC ) = Var(̂d ir AB ) + Var(̂d ir BC ). Following the method of Bucher, 9 estimates of the inconsistency parameter,̂A C , and its variance can be formed, within a loop, by subtracting the direct and indirect estimateŝA Var ) .
An approximate test of the null hypothesis of consistency is conducted by referring the test statistic to the normal distribution.

Inconsistency parameter approach
The inconsistency parameter approach of Lu and Ades 10 involves adding an extra parameter (the inconsistency parameter) to each treatment loop within a network to assess inconsistency and estimate both the direct and indirect evidence simultaneously. This allows estimates of the direct and indirect information to be obtained for each comparison within the treatment loop. In a network containing one three-treatment loop between treatments A, B, and C, let ABC represent the inconsistency parameter for this loop. For example, under the Royston-Parmar model for time-to-event outcomes, the log cumulative hazard for patient i in trial j is given by where s (ln(t)) is the restricted cubic spline modeling the baseline log cumulative hazard for trial j, trt1 ij and trt2 ij are treatment indicator variables, and 1 and 2 are the treatment effect estimates for trt1 ij and trt2 ij compared to the reference treatment, respectively.

Node-splitting
Node-splitting compares a model where the consistency assumption is relaxed for one treatment comparison to the model assuming consistency across the entire network to highlight inconsistent treatment comparisons within the network. Each treatment comparison is considered separately and one at a time for evidence of possible inconsistency. Node-splitting can be implemented using the "network sidesplit all" command 23 in Stata, 24 which reports the treatment effects from the direct and indirect evidence together with their difference and a test of whether the true difference is equal to zero for each treatment comparison. 23

Net heat plot
In 2013, Krahn et al 12 introduced the net heat plot as a method for identifying and locating inconsistency within a network of RCTs. In a network of RCTs with at least one treatment loop, the net heat plot is constructed by temporarily removing (also referred to as detaching) each design one at a time and assessing the contribution of each design to the inconsistency of the whole network. Krahn et al 12 propose the use of a design-by-treatment interaction approach, whereby the consistency assumption for one of the treatment loops is relaxed so that the remaining inconsistency across the network can be calculated. In practice, this is computationally simple because it is equivalent to a "leave one out" approach in which Q inc is simply recalculated from scratch after the (temporary) removal of each design in turn (which is equivalent to removing each loop in turn, assuming each design features in only one loop). Designs that do not contribute to a treatment loop or when removed would split the network into two distinct parts are excluded from the net heatplot.
In an NMA model, the design matrix contains the structure of the network at the study level and links the observed treatment effects with the treatment contrast parameters. To detach design d, we add to the design matrix additional columns. The number of columns to add is equal to the number of treatments in design d minus 1. Thus, when design d includes two treatments, one column is added, consisting of a "1" in the row corresponding to the design, which is being detached and "0" elsewhere (this is analogous to perfectly fitting an observation in a regression by including a dummy variable for just that observation). The treatment effects for each comparison in the network are then recalculated using this new design matrix, and the inconsistency in the network when design d is detached is thus calculated.
The between-design inconsistency statistic, Q inc , is the part of the total heterogeneity in the network that is not explained by heterogeneity within designs. Let Q inc c represent the inconsistency in the network for design c before any designs are The values of Q diff c,d form the basis of the net heat plot. The net heat plot is constructed as a matrix in which each off-diagonal square is Q diff c,d , representing the contribution of the row design (c) to the total inconsistency across the network when the column design (d) is detached (ie, the consistency assumption is relaxed for the column design). The leading diagonal, running from the top left to the bottom right corner, displays the contribution of each design c, Q inc c , to the between design statistic, Q inc .
Moreover, in each net heat plot, the area of the grey squares within each matrix cell are proportional to the absolute values of the hat matrix (of the NMA regression model with no designs detached). These are interpretable as the (statistical information) contribution of the direct estimate of the column design to the network estimate of the row design. As proposed by Krahn et al, the net heat plot is colored so that values of Q diff c,d > 0 take on yellow and red colors and values of Q diff c,d < 0 take on white and blue colors. The coloring varies in intensity with the maximum intensity (ie, the brightest colors) representing absolute values of Q diff c,d greater than or equal to eight. Red colors indicate that the contribution of the evidence from the column design toward the row design is inconsistent with the other evidence in the network. Blue colors indicate that the contribution of the evidence from the column design toward the row design is consistent with the other evidence in the network. 25 This enables the reader to identify which designs are most likely to be responsible for the inconsistency in the network.
Net heat plots can be produced with the package netmeta 21 in R. 26

A CLOSER LOOK AT THE NET HEAT PLOT
As NMA is a form of regression, we would expect any diagnostic useful in the NMA case to be meaningful in simpler cases. We now look in more detail at the calculation underlying the net heat plot starting in Section 4.1 by considering a three-treatment network before generalizing the result and exploring the interpretation in Section 4.2.

Three-treatment network
We consider a three-treatment network, consisting of treatments A, B, and C, in which direct evidence is available for all pairwise comparisons. In this setting, we consider two-arm trials only. For design c, the inconsistency Q statistics are defined as where s 2 c = Var(̂d ir c ). Q inc c represents the difference between the direct and network evidence for design c across the whole network. Continuing with c = AC, we have Q inc c(d) represents the difference between the direct and network evidence for design c when design d is detached, and Q diff c,d represents the change in inconsistency for design c when design d is excluded from the network so that When d ≠ c, the pathway of indirect evidence must include design d. Therefore, the network estimate of design c when design d is detached iŝn et AC(d) =̂d ir AC . In this setting, Q inc AC(d) = 0. Therefore, (5) can be rewritten as When d = c, the network estimate for design c, when the direct evidence for design c is excluded, is equal to the indirect evidence for design ĉn In both cases, (6) and (7) are scaled and squared versions of the inconsistency parameter (1). Thus, the net heat statistics are correlated with the formal inconsistency test statistic in this setting, in this example. However, these scaled versions of the inconsistency parameter have scaled chi-squared distributions, making them awkward to interpret; why scale when the unscaled version has a known distribution?

Generalizing the net heat plot to a network with k+2 treatments where direct evidence is limited to specific comparisons
In this section, we use a more general network to illustrate the mathematics behind the net heat plot. We assume a network of two-arm trials consisting of treatments A and B and additional treatments X 1 , X 2 , … , X k . In this network, there is only direct evidence comparing A versus B, A versus X 1 , X 2 , … , X k and B versus X 1 , X 2 , … , X k . There are no trials directly comparing X i and X j . We make the same assumptions as before: each trial has the same number of patients and each comparison has the same number of trials. Here, for simplicity, we assume the variance of the treatment effect, s 2 , is common to all designs. We assume an equal weight of 1 s 2 for each of the direct comparisons in the network so that each indirect comparison has weight 1 2s 2 . We let c be the design of interest (eg, A versus B), with direct estimatêd ir c . There are k possible indirect pathways, each involving a single additional node. Each additional node adds one loop to the network. Therefore, there are a total of k + 2 treatments relevant to design c. Denote the indirect estimates bŷi nd(i) The network estimate of c is equal to the weighted average of all the direct and indirect evidence combined, that is, To .
We write the difference between the network evidence on c when d is excluded and the network evidence on c in terms of̂i nd(i) c and putting it all together Else, if the direct comparison is detached, For k = 1, the three-treatment case, we obtain (6) and (7). Suppose k is large so that k + 1 ≈ k; then, we can approximate (8) by Essentially, (9) is a scaled product of two terms ; then, if k is large, we can simplify further Full details can be found in Appendix B (supplementary material). Term P 1 is the difference between the average indirect estimate for design c excluding design d and the indirect evidence for design c "from design d." While the square of this is a plausible measure of the difference between the evidence coming from the loop including design d and the rest of the network (excluding the direct evidence), it is not specific to design d but to the loop including design d.
Term P 2 is a scaled difference between the direct evidence for design c and the indirect evidence for design c. Term P 2 can be large if the direct and indirect evidence differ and small if the direct and indirect evidence are similar. Therefore, in some cases, it could be a poor choice of multiplier for term P 1 .
We conclude that the terms used in the net heat plot neither generally identify designs causing inconsistency nor are necessarily relatively large if inconsistency is present (as P 2 may be small).

SIMULATION STUDY: WHAT HAPPENS AS WE INCREASE THE NUMBER OF TREATMENT LOOPS IN A NETWORK?
In Section 4.2, we used equal variances to simplify calculations. However, this is unlikely to be realistic in most NMA cases. We now address this by using simulation to investigate what happens when we have the situation described in Section 4.2 where P 1 is large, P 2 is small, and we have unequal variances: our aim is to demonstrate that P 2 is a poor choice of multiplier for P 1 . In more detail, the aim of this simulation study is to show, in a network in which we know there is inconsistency, that as the network increases in size, the ability of the net heat approach to identify this inconsistency is diminished.
We consider a network consisting of one treatment loop in which all the treatment effects are the same. We then inflate the treatment effect in one design to introduce inconsistency into the network. Treatment loops are added one at a time to the network and the values of Q diff c,d , Q inc c , and Q inc c(d) are monitored. As above, Q inc c quantifies the total amount of inconsistency for design c before detachment of design d. Q inc c(d) quantifies the total amount of inconsistency for design c after detachment of design d. Q diff c,d quantifies the reduction in inconsistency for design c following the detachment of design d. Specifically, we start with a network consisting of one treatment loop (A,B,C). For each design, we simulate six trials. We generate the true treatment effects for each trial from designs AB and BC from a normal distribution with mean 0 and standard deviation 0.2. We generate the true treatment effect for the design AC for each trial from a normal distribution with mean 2 and standard deviation 0.2. This has the effect of introducing inconsistency between the direct and indirect evidence for the AC comparison. For each simulated trial treatment estimate, a corresponding standard error estimate is simulated from the normal distribution for the treatment effect with mean 0 and standard deviation 1. This ensures the standard error estimates are positive. As we move through the sequence of networks, each time we resimulate, the true treatment effects from these distributions. We repeat this process, adding one treatment at a time. At each stage, we have a network of two-arm trials consisting of treatments A and B and additional treatments X 1 , X 2 , … , X k . There is only direct evidence comparing A versus B, A versus X 1 , X 2 , … , X k and B versus X 1 , X 2 , … , X k . There are no trials directly comparing X i and X j . We stopped when we reached 10 treatment loops. Q diff c,d , Q inc c , and Q inc c(d) are calculated with c = AB and d = AC. R code can be found in Appendix D (supplementary material).
In this situation, we know that before detachment of designs, inconsistency will be present between the direct and indirect estimates for the design AB because the indirect estimate for AB includes the inflated estimate of AC. Detaching design AC will then remove the inconsistency in the network, which will be quantified by Q diff c,d . Figure 4 plots Q diff c,d against the number of treatment loops in the network. Estimates of Q diff c,d , Q inc c , and Q inc c(d) are presented in Table S1 (supplementary material). In terms of the notation used in Section 4.2, we expect to see that as we increase the number of treatment loops in the network, P 1 remains the same, but P 2 is reduced because adding more indirect evidence to the calculation of ind c "waters down" the direct evidence coming from design d and thus masking the inconsistency in the network, which shows that P 2 is a poor choice of multiplier for P 1 . Figure 4 and Table S1 (supplementary material) confirm this, showing that inconsistency due to design d in the net heat plot diminishes as the number of treatment loops increases but the amount of inconsistency in loop ABC remains the same. Therefore, as we increase the size of the network, the effect of inconsistency in one design is reduced so that in a network with a large number of loops, inconsistency will be hidden, ie, as we increase the amount of direct evidence on design c, the inconsistency in design d is masked. The net heat plot highlights concerns about inconsistency in a network when Q diff c,d > 8. In this example, concerns about inconsistency are masked once there are seven or more treatment loops. Inconsistency is a property of loops and as such the loop-specific approaches considered in Sections 3.1, 3.2 and 3.3 are not affected by increasing the number of treatment loops in a network. However, node-splitting models which compare the direct and indirect evidence for a comparison may be affected by increasing the number of consistent treatment loops. Therefore, we applied the node-splitting approach to the same 10 simulated datasets. As expected, increasing the number of consistent treatment loops in the network (ABX 1 , ABX 2 , … ) increased the sources of indirect evidence and reduced the The key differences between the net heat plot and the node-splitting approach are that (1) the net heat plot multiplies P 1 and P 2 while claiming to identify when P 1 is large (irrespective of P 2 ) and (2) the node-splitting approach gives a statistically valid estimate of P 2 and test of the null hypothesis that it is zero.

APPLICATION OF METHODS FOR ASSESSING INCONSISTENCY
In this section, we apply the five methods for assessing inconsistency described in Section 3 to the lung cancer and diabetes networks.

Lung cancer network
We now apply the methods described in Section 3 to the lung cancer network introduced in Section 2.1. Cochran's Q statistic showed evidence of statistically significant heterogeneity in the whole network (Q = 56.59, 40 df, p=0.043) and inconsistency between designs (Q inc = 4.52, 1 df, p=0.034. Heterogeneity within designs was close to the threshold of 0.05 but did not reach statistical significance (Q het = 52.07, 39 df, p=0.079). In the lung cancer network where there are no multi-arm trials the loop inconsistency approach and Cochran's Q statistic are algebraically equivalent and therefore provide the same level of evidence for inconsistency in the lung cancer network. Letting A = RT, B = Seq CT, and C = Con CT, we havêd ir AB = −0.132, Var To assess inconsistency and estimate both the direct and indirect evidence simultaneously, we conducted a NMA using the Royston-Parmar time-to-event model, including a fixed effect inconsistency parameter following the method of Lu and Ades. 10 The inconsistency parameter was fitted with a noninformative normal prior distribution. The inconsistency parameter was estimated as −0.176 (95% Credible Interval: −0.337, −0.016), giving an approximate p-value of 0.032 and suggesting evidence of network inconsistency. Node-splitting also resulted in p=0.033 for the difference between the direct and indirect evidence for each treatment comparison (Table 1).  The net heat plot is presented in Figure 5. The yellow colors indicate Q diff c,d > 0. However, there are no areas of vibrant red, so it may be reasonable to conclude that there is no meaningful inconsistency in the lung cancer network, in contrast to the methods above. The difference in the shades of yellow suggests that inconsistency is most important in the Seq CT v Con CT treatment comparison. However, the Seq CT v Con CT comparison has the least amount of direct evidence, and therefore, the decomposition of Q has attributed the inconsistency mainly to this comparison.
To explore (6)  The Q statistics can be calculated from (3), (4), and (5) as follows: 026, which gives the same result as (6), indicating negligible inconsistency, in contrast, to a formal statistical test which rejects the null hypothesis with p=0.03.

Diabetes network
We now apply the methods described in Section 3 to the diabetes network introduced in Section 2.2. Cochran's Q statistic showed evidence of statistically significant inconsistency between designs (Q inc = 22.53, 7df, p=0.002) and within designs (Q het = 74.46, 11df, p<0.001). The net heat plot ( Figure 6) raises concerns about inconsistency (Q diff c,d > 8) within the metformin (metf), sulfonylurea (sulf), and rosiglitazone (rosi) treatment loop and particularly the comparisons involving sulfonylurea. However, the loop inconsistency and node-splitting approaches are able to formally test this. The results of node-splitting in the diabetes network are presented in Table S2 (supplementary material). For the sulfonylurea and rosiglitazone and sufonylurea and metformin comparisons, p<0.001, suggesting evidence of important inconsistency within the diabetes network.
We have not applied the inconsistency parameter approach to the diabetes network. In a large network such as the diabetes network, it is computationally simpler to use the node-splitting approach instead.
In this example, the net heat plot is in agreement with the loop inconsistency and node-splitting approaches with all three identifying important inconsistency within the metformin, sulfonylurea, and rosiglitazone treatment loop. All three approaches also identified the treatment loop metformin, pioglitazone (piog), and placebo (plac) as an area of concern. The net heat plot colors this treatment loop yellow (Q diff c,d ≈ 4), suggesting that although inconsistency may be present, it is not important. The loop inconsistency approach is able to formally test this and reaches a similar conclusion (z=1.80, p=0.073). The node-splitting approach also suggests evidence of important inconsistency in the network (Table  S2, supplementary material).
In this example, the net heat plot, the loop inconsistency approach, and node splitting all identified the same treatment loops as potential sources of inconsistency in the network. However, the loop inconsistency and node splitting approaches are able to formally test inconsistency in loops. Therefore, in this example, node-splitting is advantageous over the net heat plot because it not only assesses all the treatment loops in the network but is also able to formally test for evidence of important inconsistency.

CONCEPTUAL CRITIQUE OF THE NET HEAT PLOT
The net heat plot aims to identify a specific design (or designs) that drive inconsistency in a network. However, locating inconsistency to a specific design (or even a pair of designs) is a difficult and sometimes impossible task since inconsistency arises from comparisons between at least three designs. In a three-treatment network, inconsistency can only be identified and not actually located. Thus, any attempt to locate inconsistency within designs is potentially misleading, in particular because it may tend to attribute inconsistency to areas with less evidence. For example, in Figure 5, the difference in the shades of yellow suggests that inconsistency is most important in the Seq CT v Con CT treatment comparison. However, the Seq CT v Con CT comparison has the least amount of direct evidence, and therefore, the decomposition of Q has attributed the inconsistency mainly to this comparison. We expect something similar would also happen in more complex networks.
Within a network one (or more) deviating direct comparison(s) may affect the network estimates of other comparisons, producing hot spots of inconsistency, ie, treatment comparisons responsible for inconsistency in one or more treatment loops. 12 The very concept of a "hot spot" is not clearly defined by Krahn et al, 12 and the asymmetric nature of the net heat plot makes interpretation harder. In addition, Krahn et al 12 were unclear about how the intensity of color in the net heat plot relates to important, clinically meaningful inconsistency. For example, in Figure 5, the yellow colors indicate Q diff c,d > 0. However, for our lung cancer network, there are no areas of vibrant red, so it may be reasonable to conclude that there is no meaningful inconsistency in the lung cancer network, in contrast to Section 6.1.
Inconsistency is a loop property; it does not make sense at the level of an individual design. Further, it cannot be linked to a specific design in the loop unless at least one design is part of more than one loop. In other words, locating inconsistency within a network depends on the structure of the network, and no simple method works for all networks. Identifying inconsistency will depend to some extent on the network connectedness and the number of treatments and trial designs. Indeed, if more than one design deviates from the true effect, then it is possible that inconsistency might be masked. Similarly, inconsistency might be harder to spot in a fully connected network, where there are numerous pathways of indirect evidence, than in a network with fewer direct (and indirect) connections.
Unlike Q, Q het , and Q inc , which follow chi-squared distributions, Q diff c,d as the difference between two approximately chi-square distributed, correlated components, has a nonstandard distribution and is therefore hard to interpret. Complex calculations would be required to calculate the sampling distribution and obtain a p-value. One possibility would be to use bootstrapping, but since Q diff c,d does not have a natural interpretation, we did not pursuethis. Ideally, what is needed is a way to combine the graphical approach utilized by the net heat plot with the results of the formal statistical tests implemented in the node-splitting and loop inconsistency approaches to produce a graphically accessible way for identifying inconsistency in networks.

DISCUSSION
Inconsistency in a network can lead to biased treatment effect estimates; therefore, it is important that attempts are made to identify, understand, and adjust for inconsistency. There are many methods for assessing inconsistency in NMA. In this paper, we considered five of the most popular methods from the simplest method of loop inconsistency 9 to more complex models such as the inconsistency parameter approach 10 and the graphical net heat approach. 12 The net heat plot calculates the change in inconsistency across the network caused by relaxing the consistency assumption for each design. The change in inconsistency is known as Q diff c,d , and these values are displayed graphically in the net heat plot. We derived a formula for Q diff c,d , which could be applied to a network in which two treatments are both directly compared with other treatments to quantify the amount of inconsistency in the network using the net heat plot. We have shown that Q diff c,d can be difficult to interpret and, in some cases, a misleading measure of inconsistency. In the special case of three-treatment networks, it is approximately an arbitrary scaled version of the difference between the direct and the indirect evidence, which explains why, in the lung cancer example, the net heat plot did not identify the same possibility of inconsistency as the analyses in Section 6.1. We advise that the net heat plot is interpreted with caution.
The net heat plot uses Cochran's Q statistic 8 in a fixed effect framework and decomposes it into within-trial and between-trial heterogeneity. This reflects the fact that heterogeneity and inconsistency can be considered as different aspects of heterogeneity, where inconsistency is the discrepancy between results of single studies and predictions based on a consistency model. 12 The within-trial and between-trial heterogeneity statistics are assumed to follow chi-squared distributions. The lung cancer example showed little evidence of heterogeneity, and therefore, it was appropriate, for this example, to use a fixed effect model that assumed that there was no heterogeneity within designs. Although more complex, the calculations in Section 4.2 could be conducted using a random effects model, and this may be more appropriate when heterogeneity is present in a network. However, further investigation is required to determine how the net heat plot identifies inconsistency when heterogeneity is present.
In this paper, we have shown through simulation that inconsistency in larger networks may be hidden when using the net heat plot alone (Figure 4). We have also shown that the statistics on which the net heat plot is built are sensible in some scenarios but have a somewhat arbitrary weighting. In all scenarios, they are scaled versions of the loop inconsistency test statistic and as such have scaled chi-squared distributions. However, as Hoaglin 27 discusses, the Q statistics only approach the chi-squared distribution if the study sizes are large (mainly because the standard errors are generally not known but estimates), which may not be the case in many meta-analyses. While this can be important in applications, it does not invalidate our arguments in this paper. Therefore, in all situations, the statistics behind the net heat plot are unintuitive, awkward to interpret, and do not lend themselves to statistical testing. Furthermore, we have shown that the statistics underpinning the net heat plot can neither generally identify designs causing inconsistency nor are they necessarily relatively large if inconsistency is present. Hence, inconsistency in larger networks may be hidden when the net heat plot is used on its own to identify inconsistency. Therefore, it may be that no one method should be considered alone for assessing inconsistency and that a combination of approaches is the best way forward although this introduces the challenge of interpreting potentially conflicting results from multiple tests.
Throughout this paper, except for the diabetes network, we assumed all networks contained two-arm trials only, and the indirect evidence for a design was assumed to come from pathways involving one additional treatment only. While this is unlikely to be true in larger networks, the weighting of the indirect evidence gets smaller as more additional treatments are involved so the contribution of longer pathways to the indirect evidence is minimal. Furthermore, we have shown that the net heat approach can be misleading when only considering two arm trials. Therefore, given the added complexity of including multiarm trials in a network, it is likely that interpreting the net heat plot will only become more problematic with increasing network complexity.
Using the loop inconsistency approach to test for inconsistency within each loop leads to problems with multiple testing and can be cumbersome in networks with many treatment loops. By contrast, the inconsistency parameter approach is straight forward to incorporate within most NMA models and quantifies inconsistency but does not provide a straight forward way for locating the inconsistency. In large networks, the net heat plot is straight forward to implement, and the provision of freely available user-friendly software is likely to increase the popularity of the approach. Previously, node-splitting was cumbersome in large networks as each comparison of interest requires a separate model. However, a decision rule that chooses which comparisons to split, only selecting comparisons in potentially inconsistent loops but ensuring that all potentially inconsistent loops in the network are investigated, has eliminated most of the manual work involved in using the node-splitting approach, even in large networks. 28 Furthermore, node-splitting has the added advantage over the net heat approach of being able to statistically test for evidence of inconsistency.
Other methods of assessing inconsistency which have not been considered in this paper include the design-by-treatment interaction model, 5,29 random inconsistency effects, 30-34 factorial analysis of variance, 35 generalized linear mixed models, 36,37 and the two-stage approach. 38 Furthermore, if covariates are distributed unevenly between trials, then inconsistency may be reduced by adjusting for covariates. 39,40 For a review of methods for assessing inconsistency in NMA, we recommend Donegan et al. 15 All methods to assess inconsistency should be interpreted cautiously, taking the clinical context into account.
In MA, forest plots can be used to check for outlying single studies and highly weighted studies, which can both be influential. In NMA where evidence for a treatment comparison comes from several sources, a forest plot may not provide all the information necessary for assessing influential trials or designs. Additional complexity arises when a network includes multiarm trials. Therefore, careful exploratory work plus presenting the results as in Figure 2 are the key rather than the net heat plot. 41 Furthermore, recent work to reduce the cumbersome nature of using node-splitting in large networks 28 means that an accessible graphical display of node-splitting results may be the graphical representation of inconsistency that analysts need to identify inconsistency in theirNMAs.
It is important that attempts are made to identify, understand, and adjust for inconsistency in a network. The net heat plot is an arbitrary weighting of the loop inconsistency statistics, which does not lend itself to statistical testing and can mask inconsistency in larger networks. We advise that the net heat plot is used with caution. Alternative graphical methods to the net heat plot, which appropriately assess the amount of inconsistency within a network and display the results graphically, clearly highlighting influential and inconsistent designs, are needed.