Assessing the consistency assumptions underlying network meta‐regression using aggregate data

When numerous treatments exist for a disease (Treatments 1, 2, 3, etc), network meta‐regression (NMR) examines whether each relative treatment effect (eg, mean difference for 2 vs 1, 3 vs 1, and 3 vs 2) differs according to a covariate (eg, disease severity). Two consistency assumptions underlie NMR: consistency of the treatment effects at the covariate value 0 and consistency of the regression coefficients for the treatment by covariate interaction. The NMR results may be unreliable when the assumptions do not hold. Furthermore, interactions may exist but are not found because inconsistency of the coefficients is masking them, for example, when the treatment effect increases as the covariate increases using direct evidence but the effect decreases with the increasing covariate using indirect evidence. We outline existing NMR models that incorporate different types of treatment by covariate interaction. We then introduce models that can be used to assess the consistency assumptions underlying NMR for aggregate data. We extend existing node‐splitting models, the unrelated mean effects inconsistency model, and the design by treatment inconsistency model to incorporate covariate interactions. We propose models for assessing both consistency assumptions simultaneously and models for assessing each of the assumptions in turn to gain a more thorough understanding of consistency. We apply the methods in a Bayesian framework to trial‐level data comparing antimalarial treatments using the covariate average age and to four fabricated data sets to demonstrate key scenarios. We discuss the pros and cons of the methods and important considerations when applying models to aggregated data.


| INTRODUCTION
Reviews often compare multiple treatments for the same condition. In such cases, network meta-analysis (NMA) can compare all treatments (eg, Treatments 1, 2, and 3) in a single analysis by estimating the relative treatment effects (eg, log odds ratios) for all treatment pairings (eg, 2 vs 1, 3 vs 1, and 3 vs 2) using direct and indirect evidence. [1][2][3] The key assumption underlying NMA is consistency of the treatments effects across direct and indirect evidence. 3 Many methods have been proposed to assess the consistency assumption underlying NMA, 4 including node-splitting models 5,6 and inconsistency models, such as the design by treatment (DBT) inconsistency model [7][8][9][10][11] and the unrelated mean effects (URM) inconsistency model. 12 Network meta-regression (NMR) is an extension of NMA that examines whether a covariate modifies each of the relative treatment effects. 13 A covariate may modify each relative treatment effect differently; that is, each treatment comparison may have a different covariate interaction. NMR is used to explore causes of heterogeneity or inconsistency or when known effect modifiers exist and we wish to present results for different patient groups. Covariates may be characteristics of patients (eg, weight), treatments (eg, additional therapy), studies (eg, location), or methods (eg, allocation concealment). [14][15][16] NMR results commonly consist of, for each comparison, one relative treatment effect estimated at the covariate value 0 (or at the mean covariate value when the NMR model is centered) and one regression coefficient for the treatment by covariate interaction. Consistency assumptions are required for both of these parameters. [17][18][19] For instance, for a three-treatment NMR, where Treatment 1 is taken as the reference, the consistency equation for the relative treatment effects can be written as, d 23 = d 13 − d 12 where, for example, d 23 is the relative treatment effect for 3 vs 2, and the consistency equation for the regression coefficients is β 23 = β 13 − β 12 where, for example, β 23 is the coefficient for 3 vs 2. 13,17,19 It is possible for neither assumption to hold (ie, inconsistent relative treatment effects and inconsistent coefficients) or for only one of the assumptions to hold (ie, either consistent relative FIGURE 1 Graphs showing how the relative treatment effect (eg, log odds ratio) for Treatment 3 vs Treatment 2 could change with a covariate value with separate lines representing direct evidence (from trials that allocated Treatments 2 and 3), indirect evidence (from the remaining trials), and all evidence in various scenarios: A, there is no treatment by covariate interaction based on all evidence, and the relative treatment effects at 0 covariate are consistent, and the regression coefficients for the treatment by covariate interaction are consistent; B, there is an interaction based on all evidence, and the relative treatment effects at 0 covariate are consistent, and the coefficients are consistent; C, there is no interaction based on all evidence, and the relative treatment effects at 0 covariate are consistent, and the coefficients are inconsistent; D, there is an interaction based on all evidence, and the relative treatment effects at 0 covariate are consistent, and the coefficients are inconsistent; E, there is no interaction based on all evidence, and the relative treatment effects at 0 covariate are inconsistent, and the coefficients are consistent; F, there is an interaction based on all evidence, and the relative treatment effects at 0 covariate are inconsistent, and the coefficients are consistent; G, there is no interaction based on all evidence, and the relative treatment effects at 0 covariate are inconsistent, and the coefficients are inconsistent; and H, there is an interaction based on all evidence, and the relative treatment effects at 0 covariate are inconsistent, and the coefficients are inconsistent. Direct, indirect, and all evidence is overlapping in plots (A) and (B) treatment effects or consistent coefficients), which would make the results of the NMR unreliable.
Theoretically, there are eight possible scenarios that can occur when assessing whether treatment by covariate interactions exist and the consistency assumptions. Examples of the scenarios are shown in Figure 1A-H. Each figure shows how the relative treatment effect for 3 vs 2 changes with an increasing covariate value; separate lines are displayed for direct, indirect, and all evidence. For a three-treatment network, the direct evidence for 3 vs 2 would be from trials that allocated Treatments 2 and 3 and the indirect evidence for 3 vs 2 would be from the remaining trials. Note that the lines have the same intercept when the relative treatment effects at the covariate value 0 are consistent (Figure 1 A-D) and the lines have the same slope when the coefficients are consistent ( Figure 1A, B, E, and F). In Figure 1A, no interaction is detected using NMR, and both consistency assumptions are satisfied; therefore, the NMR results are valid but would not be clinically useful. On the other hand, in Figure 1B, NMR shows an interaction and both assumptions hold; therefore, the NMR is reliable and could be used to draw clinical inferences. Figure 1C, E, and G show scenarios where no interaction is detected using NMR, but one or more of the assumptions are not satisfied; consequently, the NMR results are invalid; notably, in Figure 1C,G, an interaction exists when direct evidence, and indirect evidence are considered separately, but it is not seen when applying NMR because it is masked by the inconsistency. Lastly, in Figure 1D, F, and H, an interaction is found using NMR, but one or more of the assumptions do not hold, so the NMR results are unreliable. The cause of inconsistency should be considered when inconsistency is found ( Figures 1C-H).
In this paper, we introduce methods for assessing the consistency assumptions underlying NMR. We extend existing node-splitting models, 5,6 the DBT inconsistency model, [7][8][9][10][11] and the URM inconsistency model 12 to incorporate treatment by covariate interactions. In Section 2, we specify the NMR model and propose assessment methods that can be applied to aggregate trial-level data (ie, trial specific relative treatment effects relative to reference Arm 1 and their variances) with either continuous or categorical covariates. In Section 3, we apply the methods to a real data set and fabricated data sets illustrating key scenarios under a Bayesian framework. In Section 4, we discuss the proposed methods and highlight their pros and cons.

| METHODS
We outline NMR models and then introduce methods for assessing consistency using the node-splitting models and one type of inconsistency model (ie, URM model). New methods based on the alternative DBT inconsistency model are also presented in the supplementary material. All models are summarized in Table 1.
To set notation, let i denote the trial where i = 1, …, S and S is the number of independent trials and let k be the trial arm where k = 1, …, A i and A i is the number of arms in trial i. Let t ik denote the treatment given in trial i in arm k where t ik ∈ {1, ……, T} and T is the number of treatments in the network. Note that Treatment 1 is taken to be the reference treatment. Suppose we have trial-level outcome data, where y ik is the observed relative treatment effect (eg, log odds ratio or mean difference) for arm k vs Arm 1 (with k ≥ 2) in trial i and v ik is the corresponding variance. As the relative treatment effect is a continuous measure, we assume a normal likelihood y ik~N (θ ik, v ik ) where θ ik is the mean relative treatment effect in trial i (with k ≥ 2). Also, the data set would include a study-level covariate x i for each trial i that can be a continuous variable or an indicator variable to represent dichotomous data.

| Network meta-regression models
NMR models estimate the basic regression coefficients, which are the coefficients for each treatment vs Treatment 1 (ie, β 12 , β 13 , …, β 1T ), and then the remaining functional coefficients (ie, β 23 , β 24 , ….) are calculated as linear combinations of the basic coefficients using the consistency equations. Three NMR models have been proposed previously, each making different assumptions regarding the basic coefficients, 13,[17][18][19] that is, independent (model 1a), exchangeable (model 1b), and common coefficients (model 1c). The decision regarding which assumption to make can be based on model fit statistics and the estimated coefficients of the models but in practice is often determined by data availability.
Model 1a can be written as follows: Where β t i1 ;t ik =β 1;t ik -β 1;t i1 , β t i1 ;t ik is the difference in the relative treatment effect of t ik vs t i1 per unit increase in the covariate x i , or in other words, the regression coefficient for the treatment by covariate interaction. In a random-effects model, δ i,1k (with k ≥ 2) represents the trial-specific relative treatment effect of t ik vs t i1 when the covariate is 0 (x i = 0) and is assumed to be a realization from a normal distribution δ i;1k ∼N d t i1 ;t ik ; σ 2 À Á ;t ik is the mean relative treatment effect of t ik vs t i1 when the covariate is 0. In a fixed-effect model, we set σ 2 = 0 to obtain Model 1b is the same as model 1a, but now, β 1;t ik ∼Norm B; υ 2 ð Þ: Model 1c is formulated by setting β 1;t ik ¼ β in model 1a; note that in this model, the functional coefficients are 0 because of the consistency equations (eg, β 23 = β 13 − β 12 = β − β = 0). 17

| Assessing consistency by node splitting
The principle aim of node-splitting models is to assess whether there is evidence of "loop inconsistency," where loop inconsistency is defined as a difference between a result from direct and indirect evidence. Node-splitting models estimate relative treatment effects and/or regression coefficients for the interaction based on direct evidence and separate estimates from indirect evidence to explore whether they agree. Multiple node-splitting models need to be applied, one model for each comparison of interest.
To specify the node-splitting models, we extend the notation, such that the node being split is ( b t, t * ) where b t ≠ t * and b t < t * :For example, if one wants to split the node (3, 4), then b t ¼ 3 and t * = 4.
To assess both the consistency assumptions simultaneously, node-splitting models can split the relative treatment effect and coefficient to provide, for each comparison with both direct and indirect evidence, a relative treatment effect, and a coefficient estimated from direct evidence and an effect and coefficient based on indirect evidence. The model that splits the relative treatment effect and coefficient and includes independent interactions (model 2.1a) is an extension of model 1a as follows: Where β t i1 ;t ik =β 1;t ik -β 1;t i1 , β t i1 ;t ik represents the difference in the relative treatment effect of t ik vs t i1 per unit increase in the covariate estimated using indirect evidence, and β dir represents the difference in the relative treatment effect of t * vs b t per unit increase in the covariate estimated using direct evidence. In a random-effects model, if trial i allocated t * and b t, that is, t i1 = b t and t ik = t * , then δ i,1k~N (d dir , σ 2 ) where d dir represents the mean relative treatment effect of t * vs b t when the covariate value is 0 estimated using direct evidence; whereas if trial i did not allocate t * and b t, that is, t i1 ≠ b t and/or t ik ≠ t * , then ;t ik represents the mean relative treatment effect of t ik vs t i1 when the covariate value is 0 estimated using indirect evidence and d t i1 To assess only the consistency of the relative treatment effects, node-splitting models can split the relative treatment effect alone to produce a single coefficient that is estimated using all evidence and two relative treatment effects (ie, one estimated using direct evidence and the other estimated using the indirect evidence). The model that splits the relative treatment effect alone and includes independent interactions (model 2.2a) is where β t i1 ;t ik represents the difference in the relative treatment effect of t ik vs. t i1 per unit increase in the covariate estimated using all evidence. In this model, the trialspecific relative treatment effects, δ i,1k are distributed in the same way as in model 2.1a.
Likewise, to assess the consistency of the coefficients alone, a node-splitting model can split only the coefficient to estimate a single relative treatment effect using all evidence and two coefficients (e, one estimated from direct evidence and the other from indirect evidence). The model that splits only the coefficient and includes independent interactions (model 2.3a) is the same as model 2.1a except the trial-specific relative treatment effects; ;t ik represents the mean relative treatment effect of t ik vs t i1 when the covariate value is 0 estimated using all evidence.
Node-splitting models can be adapted to include exchangeable (models 2.1b, 2.2b, and 2.3b) or common (models 2.1c, 2.2c, and 2.3c) interactions as described in Section 2.1. Note that model 2.1c and 2.3c fix each functional coefficient based on indirect evidence (ie, β t i1 ;t ik when t i1 ≠ 1) to be 0 whereas the corresponding result from direct evidence (β dir ) is not.
The level of consistency can be assessed, by comparing the model fit of the NMR (model 1[a, b, or c]) with that of the node-splitting models (models 2.1[a, b, or c], 2.2[a, b, or c], and 2.3[a, b, or c]); inconsistency is indicated if a node-splitting model is an improved fit. Moreover, if the between trial variance is lower in the node-splitting models as compared with the NMR, inconsistency may exist. Also, for each treatment comparison, the size, direction, and precision of the relative treatment effect estimated using direct evidence can be compared with that estimated using indirect evidence. Such comparisons are subjective and when results are presented graphically and compared, care must be taken because the scale and shape of the plots can affect how different the results appear to be. Furthermore, when using Bayesian methods, for each comparison, the probability (prob) that the direct and indirect evidence differs can be calculated. For each treatment pairing, the inconsistency estimate (IE); that is, the difference between the relative treatment effect from direct evidence and indirect evidence can be calculated at each iteration of the chain, and the number of iterations for which IE ≥ 0 is counted. It is then possible to calculate the prob that the relative treatment effect from direct evidence exceeds the relative treatment effect from indirect evidence, by dividing the number of counted iterations by the total number of iterations of the chain. Lastly, assuming that the posterior distribution of the difference (IE) is symmetric and unimodal, the prob that the direct and indirect evidence agree is given by P = 2 × minimum(prob, 1 − prob). 5,26 Likewise, the regression coefficients from direct and indirect evidence can be compared in the same way.

| Assessing consistency using URM models
URM models assess global consistency that is inconsistency somewhere in the treatment network, by comparing the results from an NMR model with those from an URM model. 12 The URM model that assesses the consistency of the relative treatment effects and coefficients and includes independent interactions (model 3.1a) is the same as the NMR model (model 1a), but it does not incorporate the consistency equations (i.e. d t i1 ;t ik ¼ d 1;t ik − d 1;t i1 and β t i1 ;t ik =β 1;t ik -β 1;t i1 ), and as such, the model parameters are estimated using direct evidence only. Model 3.1a is equivalent to fitting separate pair-wise meta-regressions, except, model 3.1a assumes the between trial variance (σ 2 ) is equal across comparisons but the pair-wise metaregressions would not.
The URM model that assesses only consistency of the relative treatment effects and includes independent interactions (model 3.2a) is the same as model 3.1a but incorporates the consistency equation for the coefficients. Likewise, the UMR model that assesses only consistency of the coefficients with independent interactions (model 3.3a) is same as model 3.1a but includes the consistency equation for the relative treatment effects.
Exchangeable (models 3.1b, 3.2b, and 3.3b) or common (models 3.1c, 3.2c, and 3.3c) interactions can be included. However, it is worth noting that the independent, exchangeable, or common assumptions are slightly different to those specified for the NMR models (models 1a, 1b, and 1c). In the NMR models, we assume the basic regression coefficients (ie, β 12 , β 13 , …, β 1T ) are independent, exchangeable, or common. However, when the consistency equation for the coefficients is not used in the URM model (ie, models 3.1[a, b, or c] and 3.3[a, b, or c) we can assume that all regression coefficients, that is basic and functional coefficients, are independent, exchangeable (ie, β t i1 ;t ik ∼Norm B; υ 2 ð Þ) or common (ie, β t i1 ;t ik ¼ βÞ. In particular, this means that when including common interactions, the functional coefficients in the NMR model (model 1c) are forced to be 0, but this is not so in the URM model (models 3.1c and 3.3c).
To determine consistency, the model fit of the NMR model (model 1[a, b, or c]) and the fit of the URM models (models 3.1(a, b, or c), 3.2(a, b, or c), and 3.3[(a, b, or c]) can be compared; when an URM model is an improved fit, inconsistency may be present. Also, differences between the relative treatment effects and regression coefficients produced from the NMR model and those from the URM models may suggest inconsistency.

| Including multi-arm trials
The models can be applied to data sets including multiarm trials providing that the correlation between the observed relative treatment effect (y ik ) and the trialspecific relative treatment effects (δ i,1k ) is taken into account. For each multi-arm trial i with m arms, the observed relative treatment effects and the trial-specific relative treatment effects are assumed to follow multivariate normal distributions Furthermore, there is an extra consideration when fitting node-splitting models. 5,6 If one wants to split node (t i1 , t ik ), then a multi-arm trial will contribute direct evidence to the relative treatment effect (d dir ) as required because b t ¼ t i1 . However, the multi-arm trial would not contribute direct evidence to the estimation of the relative treatment effect, d dir , if one splits another node (eg, t i2 , t i3 ) because b t ≠ t i1 . Therefore, to overcome this problem, when a multi-arm trial compared the two treatments, t * and b t, in addition to other treatments, treatment b t is taken to be the baseline treatment t i1 for that study.
Note that for URM models including multi-arm trial data, the URM model is not the same as fitting separate pair-wise meta-regressions because the correlation in multi-arm trials is taken into account but would not be in pair-wise analyses; also, the URM model only uses t i1 as the baseline treatment so direct evidence for some pairwise comparisons would not be used whereas pairwise meta-regression could utilize all direct evidence.

| Data sets
Here, the methods proposed in Section 2 are applied to a real data set and four fabricated data sets that have been manipulated to demonstrate specific scenarios.

| Malaria data set
Two Cochrane reviews and the corresponding trials were used to construct the malaria data set; reviews compared artemether (AR), quinine (QU), and artesunate (AS). 27,28 Randomised controlled trials including patients with severe malaria were eligible. Age was considered to be an effect modifier because the clinical features of malaria differ by age and thus, all treatment recommendations are stratified by age in the reviews and World Health Organization treatment guidelines. 29 Event rates for the primary outcome, death, and the covariate, average age of patients in each trial were extracted. Two studies with missing covariate data were deleted from the data set. Using the event rates, trial-specific log odds ratios and their standard deviations were calculated in R. Table S1 displays the data. Figure 2 shows the network diagram.

| Fabricated data sets
Four fabricated data sets were constructed by manipulating the malaria data set to illustrate key scenarios: (a) no interaction is present and the relative treatment effects and regression coefficients are consistent ( Figure 1A); (b) interaction exists and the relative treatment effects and coefficients are consistent ( Figure 1B); (c) interaction exists, and the relative treatment effects are consistent, but the coefficients are inconsistent ( Figure 1D); (4) no interaction is present, and the relative treatment effects are consistent, but the coefficients are inconsistent ( Figure 1G). Example R code to generate the data sets is given in the supplementary material.
Analogous to the malaria data set, each data set compared three treatments (AS, AR, QU): there was direct evidence for each possible comparison; no multi-arm trials contributed; and a dichotomous outcome and continuous covariate was of interest. Ten trials contributed direct evidence to each comparison. For each study, a continuous covariate was taken to be a realization from normal distribution (ie, N(17, 10 2 )) truncated at 0 to ensure the covariate values were similar to those observed in the malaria data set.
The log odds ratios and regression coefficients were chosen to be similar to those estimated in the original data set. For each data set, the log odd ratio at 0 covariate of trials comparing treatments AR and AS was 0.2, trials comparing treatments QU and AS was 0.23, and trials of treatments QU and AR was 0.03. For data set one, the coefficient for each comparison was 0. For data set two, the coefficient for trials comparing treatments AR and AS was 0.02, trials comparing treatments QU and AS was 0.02, and trials of treatments QU and AR was 0. For data set three, the coefficient for trials comparing treatments AR and AS was 0.01, trials of treatments QU and AS was 0.04, and trials comparing treatments QU and AR was 0. For data set four, the coefficient for trials comparing treatments AR and AS was −0.04, trials of treatments QU and AS was 0.04, and trials of treatments QU and AR was 0.
The trial-specific observed log odds ratios were estimated from the values of log odds ratio at 0 covariate, the coefficients, and the covariates. The between-trial variance was 0. The standard error of the observed log odds ratio was 0.2 for each trial.

| Implementation
All models were fitted to the data sets using WinBUGS 1.4.3 and the R2WinBUGS package in R. Example code is provided as supplementary material. For the malaria data set, all models in Table 1 were fitted. For the fabricated data sets, only fixed-effect versions of models 1a, 2.1a, 3.1a, and 4.1a were applied because the between trial variance was 0 and the coefficients differed across comparisons. See Table S2 for the parameterization of the DBT models. The covariates were centered at their mean. All parameters were given noninformative normal prior distributions (ie, N(0, 100000)) except the between-trial standard deviation that was assumed to follow a noninformative uniform distribution (ie, Uni(0, 10)) and a weakly informative prior distribution (ie, uniform (0, 2)) was specified for the standard deviation of the exchangeable regression coefficients. Three chains with different initial values were run for 300 000 iterations. The initial 100 000 draws were discarded and chains were thinned such that every fifth iteration was retained. Convergence of the chains was assessed by inspecting trace plots of the draws.
Model fit and complexity of models was assessed using the deviance information criterion (DIC) defined as DIC ¼ D þ p D where D is the posterior mean of the residual deviance and p D is the effective number of parameters. 30 A model with a smaller DIC was preferable to a model with a larger DIC, but differences of less than three units were not considered meaningful. When models had little difference in DIC, the simplest model was chosen.

| Results
Results from NMR, node-splitting and URM models are presented here. The results from DBT models are presented in supplementary material.

| Malaria data set NMR models
Comparing fixed-effect and random-effect NMR models (models 1a, 1b, 1c), the DICs from all NMR models variations are similar (DICs 24.93-26.76 in Table S3). Also, the estimated regression coefficients for the treatment by average age interactions were quite similar for each model variation (Table S4). Therefore, results from the simplest model to the fixed-effect NMR with common interactions (model 1c) are presented.
The results of model 1c show that there is evidence of a small interaction between relative treatment effect and average age for AR vs AS and QU vs AS; the posterior median of the common regression coefficient for AR vs AS and QU vs. AS is 0.0132 with 95% credibility interval (CrI), 0.0018-0.0244, (Table S4). There is no interaction for QU vs AR because the model fixes the coefficient to be 0. However, before using these results to draw clinical inferences, the underlying consistency assumptions must be assessed.    The results from node splitting are displayed in Table 3. In the model that assesses consistency of both the log odds ratio and the coefficient (model 2.1c), the log odds ratios for AR vs AS (−2.3540, 95% CrI, −6.7650 to 2.0530) and QU vs AS (0.4316, 95% CrI, 0.2833-0.5797) based on direct evidence differs with those from indirect evidence (ie, 0.1985, 95% CrI, −0.0815 to 0.4782, and −2.1000, 95% CrI, −6.4180 to 2.4430, respectively) because only two trials contribute direct evidence for AR vs.AS and, therefore, the results are influenced by the vague prior distribution. A similar but less pronounced inconsistency is also seen for the corresponding coefficients. Yet the prob of agreement between direct and indirect evidence is low for the coefficient for QU vs AR (P = 0.06) but not remarkably low for other comparisons or the log odds ratios (Ps 0.24-0.77). Similar conclusions are drawn from models that split either the log odds ratio or the regression coefficient only (models 2.2c FIGURE 3 Posterior distributions for the log odds ratios (centered) and regression coefficients for the interaction from fixed-effect nodesplitting models with common treatment by average age interactions for the malaria data set. Results in Figure   and 2.3c). The consistency of the direct and indirect evidence is also supported graphically in Figure 3, which displays the posterior distributions of the centered log odds ratios and regression coefficients and in Figure 4, where the log odds ratio versus average age is plotted. Table 2 also displays model fit assessment results for fixed-effect URM models with common interactions (models 3.1c, 3.2c, 3.3c). The DIC of the NMR model (DIC = 25.29) is similar to those from the URM models the assess consistency of both the log odds ratio and coefficient (DIC = 23.94) or the log odds ratio alone (DIC = 27.27) (models 3.1c and 3.2c) but is slightly higher than that from the model that assesses the coefficient alone (DIC = 21.96) (model 3.3c) indicating a possible inconsistency on a coefficient. See Table 4 for the results from the NMR model and URM models. The results from the URM models are quite similar to those from the NMR model with the exception of the regression coefficient for QU vs AR. This difference in the coefficient for QU vs AR is because of the different assumptions underlying the two models; the NMR model sets the regression coefficients for AR vs AS and QU vs AS to be identical (ie, 0.0132, 95% CrI, 0.0018-0.0244) and the coefficient for QU vs AR to be 0, whereas all three coefficients are set to be identical in the URM model (ie, 0.0145, 95% CrI, 0.0044-0.0247).

URM models
Overall, there is not only evidence of an interaction from the NMR but also evidence of inconsistency; the node-splitting models show evidence of loop inconsistency for the coefficient of QU vs AR, and the URM models support this showing a possible inconsistency of the coefficients.

| Fabricated data sets Data set 1: No interaction and consistency
The DICs from each model (models 1a, 2.1a, and 3.1a) are similar (8.01-12.00); therefore, there is no obvious sign of inconsistency (Table 5). Using the results from node splitting (model 2.1a), the log odds ratios and coefficients based on direct and indirect evidence are very similar, and the probabilities of agreement between direct and indirect evidence are practically one ( Table 6). The results from the NMR model are also similar to those from the URM model (model 3.1a) ( Table 7) indicating consistency. Overall, the NMR model does not show that a treatment by average age interaction exists (Table 7) and there is no evidence of loop inconsistency using node splitting or global inconsistency using the URM model. Figure 5, which shows the results from the NMR model and node-splitting models, supports this conclusion.

Data set 2: Interaction and consistency
The DICs from the models (models 1a, 2.1a, and 3.1a) are again similar (8.00-11.99) indicating consistent evidence (Table 5). From node splitting (model 2.1a), the log odds ratios and the coefficients based on direct and indirect evidence are almost identical, and the probabilities of agreement of direct and indirect evidence are practically one (Table 6); Figure 5 shows the results graphically.
The URM model (model 3.1a) also gives comparable results to the NMR model (Table 7). In conclusion, the NMR model shows that an interaction exists for AR vs AS (0.0200, 95% CrI, 0.0074-0.0327) and QU vs AS (0.0200, 95% CrI, 0.0080-0.0321) ( Table 7) and there is no loop inconsistency using node splitting, or global inconsistency using the URM model.

Data set 3: Interaction and inconsistency
The DIC from the NMR model (model 1a) (DIC = 47.14) is much higher than those from node splitting (model 2.1a) and the URM model (model 3.1a) (11.97-11.99) suggesting inconsistency (Table 5). From node splitting, the log odds ratios based on direct and indirect evidence are comparable, but the coefficients for AR vs AS (0.0100, 95% CrI, −0.0039 to 0.0241) and QU vs AS (0.0400, 95% CrI, 0.0298 to 0.0503) and QU vs AR (0.0000, 95% CrI, −0.0125 to 0.0126) from direct evidence differ from those from indirect evidence (ie, 0.0400, 95% CrI, 0.0237-0.0562, 0.0099, 95% CrI, −0.0088 to 0.0289, and 0.0300, 95% CrI, 0.0127-0.0474, respectively); the probabilities of agreement of direct and indirect evidence are very high (Ps 0.9982-0.9990) for the log odds ratios and very low for the coefficients (Ps 0.0057-0.0062) ( Table 6). The URM model also gives results that differ somewhat from those of the NMR model (see Table 7). To summarise, the NMR model shows that an interaction exists for AR vs AS (0.0187, 95% CrI, 0.0082-0.0292), QU vs AS (0.0335, 95% CrI, 0.0244-0.0425), and QU vs AR (0.0147, 95% CrI, 0.0047-0.0248) ( Table 7) but there is also loop inconsistency in the size of the underlying coefficients based on direct and indirect evidence that is seen using node splitting ( Figure 5); the URM model identifies global inconsistency.

Data set 4: No interaction and inconsistency
The DIC from the NMR model (model 1a) (DIC = 188.36) is much higher than those from node splitting (model 2.1a) and the URM model (model 3.1a) (11.99-12.00) indicating inconsistency (Table 5). Similar to data set 3, in node-splitting models, the log odds ratios based on direct and indirect evidence are comparable, but the coefficients for AR vs. AS (−0.0400, 95% CrI, −0.0553 to −0.0246) and QU vs. AS (0.0400, 95% CrI, 0.0273-0.0529) and QU vs AR   (0.0000, 95% CrI, −0.0115 to 0.0116) from direct evidence differ from those from indirect evidence (ie, 0.0399, 95% CrI, 0.0227-0.0574, −0.0400, 95% CrI, −0.0591 to −0.0208, and 0.0800, 95% CrI, 0.0600-0.1000, respectively); the probabilities of agreement of direct and indirect evidence are very high for log odds ratios (Ps 0.9976-1.000) and 0 for the coefficients (Table 6). Also, results from the URM model are different from those of the NMR model (see Table 7). Overall, the NMR model shows that no interaction exists (Table 7) but there is inconsistency in the direction of the underlying coefficients based on direct and indirect evidence and this trend can be seen using node splitting ( Figure 5); the URM model suggests global inconsistency respectively, but these models cannot show the underlying trend.

| DISCUSSION
We have shown that node-splitting and inconsistency models can be useful for assessing the underlying consistency assumptions of NMR when using aggregate data. Once consistency has been assessed, the analyst must decide which results to present. If the direct and indirect evidence are consistent, the results from the NMR should be reliable. However, the level of heterogeneity (from the NMR or standard pairwise analyses) and goodness of fit of the NMR should be considered when drawing conclusions from the results. If there is inconsistency, the results from the NMR are questionable and the causes of inconsistency should be considered. In some scenarios, for example, when inconsistency masks an interaction, as shown in Figure 1C,G, the results would not be useable. If the original purpose of the NMR was to explore causes of heterogeneity or inconsistency in an NMA and there is no interaction and no inconsistency masking interactions in the NMR, then analysts could proceed by exploring other potentially relative treatment effect modifying covariates or reconsidering the eligibility criteria. Each of the proposed methods has different pros and cons. DBT models assess design and loop consistency and can assess global inconsistency, while node splitting assesses loop consistency and URM models assess global inconsistency; loop inconsistency is well recognized in the methodological literature but design consistency is a newer concept. 7,11 Furthermore, the DBT model requires parameterization by the analyst; therefore, the analyst needs to have a good understanding of the model and parameters. Key advantages of the DBT model and node splitting is that IEs and the prob that direct and indirect evidence agree can be obtained; however, the URM model does not provide such results. Moreover, concerns regarding multiple testing may apply to node-splitting and the DBT models where probabilities are calculated, particularly when a Frequentist approach is taken; therefore, it is important to compare model fit statistics across models, and also to be cautious in interpreting "P values" making sure to allow for multiple testing. One disadvantage of node splitting is that, as one model is fitted for every comparison with contributing direct and indirect evidence, many models may need to be fitted, which is computationally demanding, whereas only one inconsistency model would need to be applied.
Ideally, all three approaches (ie, node-splitting model, DBT model, and URM model) would be applied to provide a thorough assessment of consistency. However, in practice, the reviewer may select their preferred approach depending on the ease of application in software etc. We recommend that at least one of the global tests (ie, inconsistency models) and also node splitting are performed. Our preference is node splitting because estimates from direct and indirect evidence can be found.
We proposed and applied methods to trial-level aggregated data in this article. However, it is straightforward to adapt the models to accommodate any type of arm-level outcome data, that is, a summary of the outcome data for each arm of each trial and a covariate value for each trial. To adapt the models, a suitable link function would be chosen and nuisance parameters are included in the model to represent the effect of the baseline treatment in Arm 1 of trial i. Further details regarding arm-level NMA models are given by Dias, Sutton, Ades, and Welton 31 Moreover, collection and use of individual patient data is generally advantageous over aggregate data when studying patient-level covariates because they avoid ecological biases. 32,33 Yet it is more common to explore patient-level covariates (eg, patient age) using study-level covariate summaries (eg, average age of patients) in metaregression such as in the malaria data set. However, when using aggregate data, the possibility of confounding and ecological biases should be considered when patientlevel covariates are explored.
There are a number of issues that can arise when applying the methods, particularly with aggregate data. Parameter estimation can be a problem with limited data, such that models cannot be fitted at all, interactions exist but cannot be detected, or inconsistency exists but is not found. For instance, when all the trials that contribute to the estimation of a regression coefficient have the same covariate value or when only one trial contributes to a coefficient, this would preclude the use of models with independent interactions, but analysts may be able to apply a model with exchangeable or common interactions providing studies that contribute to another basic coefficient that has different covariate values. For example, when exploring an interaction between relative treatment effect and study location (ie, continent), studies that contribute to results for Comparison 2 vs 1 may all be carried out on the same continent provided that studies that contribute to Comparison 3 vs 1 are located on different continents. Parameter estimation may particularly be a problem when fitting the DBT model because the IEs would be imprecise when the number of trials in one or more designs is limited; to overcome this one could assume exchangeability of the inconsistency factors or use informative prior distributions. Similarly, if direct evidence is limited for some comparisons (ie, few trials or covariate values), the URM model and node-splitting models would produce imprecise results, and informative prior distributions may need to be used. Ideally, any informative prior distributions would be evidence based by eliciting them from similar meta-analyses or experts' beliefs. Finally, it is also worth emphasising that no evidence of inconsistency does not automatically imply there is consistency; inconsistency may exist but cannot be detected when data are limited and results are imprecise, and therefore, arguably the consistency assumptions and the NMR results are questionable. In the same way, in such cases, no evidence of a treatment by covariate interaction does not imply there is truly no interaction.
Conversely, with abundant data, additional modelling extensions may be feasible. For example, in nodesplitting models, we have assumed the between-trial variance is the same for direct evidence and indirect evidence, yet it is possible to incorporate two variances, one of each type of evidence. Also, the models could be adapted to include more than one covariate or other variance structures. 34 In conclusion, consistency of the assumptions underlying NMR must be assessed when NMR is applied, even when no treatment by covariate interactions are detected. It is possible that inconsistency is masking an interaction. Furthermore, results of an NMR should not be reported without assessing the underlying assumptions to determine whether the results are valid and reliable.