A forward search algorithm for detecting extreme study effects in network meta-analysis

In a quantitative synthesis of studies via meta-analysis, it is possible that some studies provide a markedly different relative treatment effect or have a large impact on the summary estimate and/or heterogeneity. Extreme study effects (outliers) can be detected visually with forest/funnel plots and by using statistical outlying detection methods. A forward search (FS) algorithm is a common outlying diagnostic tool recently extended to meta-analysis. FS starts by fitting the assumed model to a subset of the data which is gradually incremented by adding the remaining studies according to their closeness to the postulated data-generating model. At each step of the algorithm, parameter estimates, measures of fit (residuals, likelihood contributions), and test statistics are being monitored and their sharp changes are used as an indication for outliers. In this article, we extend the FS algorithm to network meta-analysis (NMA). In NMA, visualization of outliers is more challenging due to the multivariate nature of the data and the fact that studies contribute both directly and indirectly to the network estimates. Outliers are expected to contribute not only to heterogeneity but also to inconsistency, compromising the NMA results. The FS algorithm was applied to real and artificial networks of interventions that include outliers. We developed an R package (NMAoutlier) to allow replication and dissemination of the proposed method. We conclude that the FS algorithm is a visual diagnostic tool that helps to identify studies that are a potential source of heterogeneity and inconsistency.

An outlier is defined as a study with a markedly different intervention effect estimate for a given treatment comparison. 6 A study that influences aspects of the model such as parameter estimates, heterogeneity, inconsistency is defined as an influential study. A study that is an outlier is not necessarily an influential one (eg, an extremely large effect from a very small study has little influence on the results of the model) and vice versa.
In a pairwise meta-analysis, we can visually detect extreme study effects through forest and funnel plots. Several statistical methods have been suggested to accommodate the results from outliers within a meta-analysis by allowing for flexible distributions of the random effects. Lee and Thompson argued that normality might be a restrictive assumption for the random-effects model and they provided alternative distributions with heavier tails. 7 Baker and Jackson also suggested alternative distributions that downweigh outlying studies, such as long-tailed distributions 8 and marginal distributions with additional parameters to model skewness and heavier tails. 9 A random-effects variance shift outlier model is also capable of identifying and downweighing outliers. 10 Beath proposed a method that considers a mixture of outlying and nonoutlying studies and downweighs the former. 11 Most of the outlier detection techniques are extensions of methods that have been applied to regression models. Alternative heterogeneity measures in meta-analysis have recently been proposed by Lin et al that are robust in the presence of outliers. 12 Viechtbauer and Cheung extended standard outlier deletion diagnostic measures in the context of meta-analysis 13 and included them in the R package metafor. 14,15 In NMA, the extreme study effect can be visualized with a comparison-adjusted funnel plot 16 (eg, if the study markedly differs from the others for a given treatment comparison). The effect size can be rendered as aberrant not only by its mere magnitude but also by its size conditional on the comparison of the study and/or the corresponding effect derived from indirect evidence. For example, a null effect might be aberrant if all other studies in the same comparison have large effects or if the indirect evidence for that comparison suggests a large effect. Outlying and influential studies may be responsible for large heterogeneity and/or inconsistency in NMA compromising the validity of results.
Within the Bayesian NMA framework, Lu and Ades proposed the use of residual deviance, 17 Zhang et al 18 provided four measures for the detection of outlying studies by fitting the Bayesian hierarchical NMA model, and Zhao et al 19 extended several outlier detection measures for generalized hierarchical models to detect influential and outlying studies in NMA. Within a frequentist framework, Noma et al recently provided outlier diagnostics for the NMA model using multivariate random-effects meta-regression. 20 Backward algorithms are widely used to detect outlying observations and can be potentially used in NMA. They start by removing observations according to some criterion (eg, largest residual) and stop when some other criterion is met (eg, all residuals are smaller than a threshold value). 21 The main drawback of backward methods is that in the presence of a cluster of outliers it is likely that results would be affected to such a degree that outliers will not be identified as such (masking). According to Atkinson, there are several deletion methods employed in backward methods that fail to detect outlying observations due to masking. 22 In this article, we propose a forward search (FS) algorithm to detect studies with extreme results in the NMA model. The FS algorithm was initially developed as an outlier detection tool for the estimation of covariance matrices 23 and regression models. 24,25 It was subsequently extended to standard multivariate methods, 26 factor analysis, 27 and item response theory models 28 and was recently applied in meta-regression. 29 FS starts by fitting the hypothesized data generating model to a subset of the data which is gradually incremented by adding the remaining studies according to their closeness to the postulated model. In each step of the FS algorithm, parameter estimates, measures of fit, and goodness-of-fit test statistics are monitored, and sharp changes indicate the outlying behavior of the studies or observations entering the initial subset. An R package (NMAoutlier) 30 has been developed that allows the reproduction of our results and the application of the method to other data.
The article is organized as follows: Section 2 discusses motivating examples; Section 3 discusses the random effects NMA model using graph-theoretical methods as introduced by Rücker 31 ; Section 4 outlines the methodological extension of the FS algorithm to the NMA model; Section 5 presents an application of the proposed methodology in published NMAs and simulated datasets; Section 6 discusses the main findings and provides directions for using the proposed diagnostic methodology for NMA; and Section 7 contains our conclusion.

MOTIVATING EXAMPLES
The first example comprises four interventions to aid smoking cessation. 17,32 Twenty-four studies (N = 24), including 22 two-arm trials and two three-arm trials, compared the relative effects of four smoking cessation counseling programs (n = 4): defined as no contact (A), self-help (B), individual counselling (C), and group counselling (D). The outcome was whether an individual successfully stopped smoking at 6 to 12 months (binary) and the odds ratio was used as a summary measure. The dataset with arm-level data is included in the R package netmeta 33 and the corresponding R code to calculate odds ratios with the pairwise function is provided in Appendix A1. The study-level data and the odds ratios are provided in Table A1. Figure 1 (left side) shows the comparison-adjusted funnel plot 16 with interventions within comparisons ordered according to effectiveness: (1) no contact (A), (2) self-help (B), (3) group counseling (D), and (4) individual counseling (C). We can see that studies 3 and 7 lie far away from the bulk of the data judging from the large effect sizes given their sizes. However, these deviations could be genuine or due to chance and heterogeneity. Figure 2 (left side) provides the network plot for smoking cessation data.
In the second example, Gupta and Paquet 34 compared placebo and eight active interventions (denoted as treatments 1-9) for actinic keratosis. Thirty-five studies (N = 35), including three three-arm trials, compared the relative effects of placebo and eight active interventions. The outcome was participant complete clearance or equivalent and the odds ratio was used as a summary measure. The dataset and the actual treatments are provided in Table A2. Figure 1 (right side) provides the comparison-adjusted funnel plot 16 with interventions within comparisons ordered from treatment 1 to 9. We can see that study 28 with treatment comparisons 1 vs 6 vs 8 has a large effect size given its size for the treatment comparison 1 vs 6. Figure 2 (right side) provides the network plot for the actinic keratosis data. F I G U R E 1 Comparison-adjusted funnel plot 16 for smoking cessation data (left) 17,32 and actinic keratosis 34 (right side).
Comparison-adjusted funnel plot produced in R 15 from netmeta package. 33 The y-axis provides the SE, and the x-axis provides the odds ratio centered at comparison-specific effect [Colour figure can be viewed at wileyonlinelibrary.com] F I G U R E 2 Network plot for smoking cessation data 17,32 (left side) and actinic keratosis 34 ,stand at iteration j to iteration (j − 1) Abbreviations: FS, forward search; NMA, network meta-analysis.
In the third example, Sciarretta et al 35

NMA MODEL
We use the frequentist random-effects NMA model as presented by Rücker, 31 which uses all pairwise comparisons within multiarm trials by reducing their weight in the NMA. 36 We briefly describe the approach that has been implemented in the R package netmeta 33 ; for more details see articles. 31,36,37 The notation used is summarized in Table 1. Suppose that we have N studies and each study has d i arms, i = 1, … , N. Let m denote the number of observed pair- . Let us denote with n the total number of treatments.
Let represent the vector with the n absolute treatment effects. Let y = (y 1 , y 2 , … , y m ) ′ be the vector with the observed effect sizes from the N studies and s = (s 2 1 , s 2 2 , … , s 2 m ) ′ the vector with the corresponding observed standard errors. Assuming a common heterogeneity variance 2 across pairwise comparisons, the random effects NMA model is written as where S is a block diagonal within-study variance-covariance matrix with data entries s 2 1 , … ., s 2 m in the diagonal and X is the m × n design matrix that describes the structure of the network, with rows denoting the observed pairwise comparisons and columns the treatments being compared within each comparison. 31,36 We consider the true variances to be equal to the observed sample variances, an assumption that holds when sample sizes are reasonably large. denotes a block diagonal between-study variance-covariance matrix with the heterogeneity variances 2 in the diagonal and is estimated from the data. The between-study variance is estimated using a special case of the generalized DerSimonian-Laird estimator. 38,39 Let W be a m × m diagonal weight matrix with a vector of weights in its diagonal to be the observed inverse study variance of all existing comparisons. The Laplacian n × n matrix is given by L = X ′ WX. 31,36 To estimate treatment effects, the Moore Penrose pseudoinverse n × n matrix L + of the Laplacian matrix L is constructed. 31,36 In the case of multiarm studies (d i > 2), standard errors are recalculated (increased) with a back-calculation adjustment as described in Rücker and Schwarzer, 36 s 2 adjusted,i , and new reduced weights are derived. In that case, the Laplacian matrix is given with L + = − 1 2d 2 i X ′ XVX ′ X having V to be a d i × d i symmetric matrix with the observed variances of all comparisons.
We define the vector̂of dimension n that represents the effects of the interventions and a vector̃of dimension n − 1 that represents the relative effects of the interventions to a reference treatment. The (n − 1) × (n − 1) variance-covariance matrix of̃isXL +X ′ where L + is the Moore-Penrose pseudoinverse n × n matrix of L andX is the reduced design matrix of dimensions (n − 1) × n referring to the interventions reported iñ(all but the reference one). Table 1 provides the notation for the FS algorithm in NMA.

EXTENSION OF THE FS ALGORITHM TO NMA
Most methods used for outlier detection opt to divide the data into two parts: a large clean part and the outliers. FS starts by selecting candidate subsets of likely outlier-free studies and proceeds by adding one-by-one studies until all are included. FS consists of three stages (choice of the initial subset, progression of the search, monitoring of the search).
In the first stage, FS chooses the initial subset of studies by selecting a candidate subset of likely outlier-free studies. We conventionally refer to this subset as the initial subset or the "basic" set at the beginning of the search. Studies not included in this basic set constitute the "nonbasic" set. A data generating (hypothesized) model is assumed to fit the data in the initial subset.
In the stage of progression of the search, the method gradually adds studies, one-by-one, from the nonbasic to the basic subset based on how close the study in the nonbasic set is to the hypothesized model in the basic set using some objective functions. This process is repeated until all studies are included in the basic set.
In the monitoring stage, estimated model parameters, measures of model fit, and goodness-of-fit test statistics are monitored in each step/iteration. A sharp change in the monitoring measures can be an indication of an outlying study. Moreover, ordering the studies based on how close they are to the basic set makes outlying studies more likely to be entered in the last iterations.
Below we present each step of the algorithm in detail.

Choice of the initial subset
When selecting studies for the initial subset we need to ensure that all n treatments are included and that the resulting network is connected. The requirement of network connectivity for each candidate subset of studies is evaluated with the netconnection function in the netmeta package. 33 Selecting the size l of the initial subset. The number of parameters in a NMA with n treatments is n (n − 1 relative treatment effects estimates and a single heterogeneity parameter). We require the initial subset to include all n treatments. Inclusion of the number of studies equal to the number of treatments or the number of treatments minus 1 suffices if there are only two-arm studies included. The requirement can be satisfied with fewer studies in the case of multiarm studies and for some network structures with two-arm studies (eg, consider a network of studies that compare the treatments A, B, C with study comparisons A vs B and A vs C). Large initial sets can save computation time and prevent large fluctuations in the parameter estimation during the first steps of the search, but at the same time increase the chance of including outliers in the initial subset. This is not necessarily a drawback but, in such cases, it is useful to repeat the search a couple of times from random starting points. We choose to set the size equal to the maximum of the number of treatments and 20% of the total number of studies; that is, l = max (n, 0.2 × N). Other rules can be adopted.
Selecting the studies to include in the initial subset. We start with a subset of studies that ideally is outlier-free to use as the initial subset. We consider a large number of potential sets (P) of randomly chosen initial subsets of studies each of size l. We require each chosen initial subset of studies to be a connected subnetwork including all comparative interventions. If the total number of potential subsets is not very large, we can provide an exhaustive search of all subsets of studies aiming to identify the subset that is the most likely subset to be outlier-free. Alternatively, for large networks, an exhaustive analysis is prohibitive and practically unnecessary. In such cases, we may explore a large number of initial subsets (the larger the network, the larger the number of subsets to investigate for example, 100). We can measure the fit of the NMA model for each candidate initial subset of studies using an objective function. The objective function evaluates candidate subsets and returns a measure of their fit. The better the fit of a subset, the more likely it is outlier-free. Let us denote with D l p each candidate initial subset p = 1, … , P of size l. We obtain the subset-specific estimates (̂Dl p ,̂2 D l p ) of each subset D l p and calculate the objective function median(f (y i , Examples of objective functions can be defined as the median of the absolute standardized residuals or the median of the absolute log-likelihood contributions given by the median is the weight for each comparison in each study and s 2 i is adjusted to take account of a multiarm study (d i > 2). Alternatively, we may consider the mean or some other quantile of f . We considered the median because it resembles the median least of squares regression suggested by Rousseeuw 40 (and it is a robust alternative to the classical least squares estimator) and it was also considered by Atkinson and Riani 24 in the FS development. Either way, our goal is to optimize the objective function defined.
The standardized residual of a pairwise comparison for a two-arm study is given by.
For a multiarm study, we take the arithmetic mean of the standardized residuals or the log-likelihood contributions of all pairwise comparisons in this study, that is, for standardized residuals denoting the vector of all standardized residual terms within a d i -arm study. For log-likelihood contributions, in the case of a multiarm study, we take log( Among the P candidate subsets D l p , the subset that optimizes the objective function is considered as the initial subset (eg, minimize the median of Equation 1 or maximize the median of Equation 2).

Progressing in the search
For brevity, we drop the subindex p from the initial set D l p and we denote the initial basic set with D l+j and the complementary nonbasic set with (D l+j ) c at iteration j = 1, 2, … .N − l. In the first step of the algorithm (j = 1), we calculate the objective function median(f (y i , s i , X i ,̂Dl+1,̂2 D l+1 )) for each study in the initial nonbasic set y i , s i , X i (D l+1 ) c usinĝDl+1,̂2 D l+1 estimated from the basic set D l+1 . This measures the closeness between the basic set D l+1 and each study of the nonbasic set that is a candidate for addition to the basic set. The study optimizing the objective function (the median of Equation 1 or the median of Equation 2) is added to the basic set.
We proceed with the algorithm for j = 2, … .N − l until all studies are included in the basic set. At iteration j, there are l + j studies in the enlarged basic set denoted as D l+j and N − l − j studies in the nonbasic set denoted by (D l+j ) c . For the basic set D l+j , the subset-specific estimates are denoted by (̂Dl+j,̂2 D l+j ). For each iteration j, we compute the objective function median(f (y i , s i , X i ,̂Dl+j,̂2 D l+j )), the median of absolute standardized residuals (1) or the median of absolute log-likelihood contributions (2), for each observation y i , s i , X i (D l+j ) c .

Monitoring the search
In each iteration, parameter estimates, model diagnostic statistics, ranking metrics that provide treatment hierarchy, heterogeneity, and inconsistency are monitored using a plot (forward plot). Forward plots visually convey the influence of each study.

Outlier case diagnostics measures
The standardized residual for the pairwise comparison of a two-arm study i is given bŷs tand In the case of a multiarm study with d i (> 2) arms, i = 1, … ., N, the standardized residual is calculated as an arithmetic mean of the standardized residuals of all pairwise treatment comparisons,̂s tand ,stand denoting the vector of all standardized residual terms within a d i -arm study.
To explore the impact of adding a study on summary relative treatment estimates we define modified Cook's statistics for NMA (in analogy to those described in pairwise meta-analysis 13 ) as wherẽDl+j and̃Dl+j−1 are the relative treatment estimates at iteration j, j − 1, respectively. A general rule provided in the bibliography for a cut-off value of Cook's statistic is that the study j is considered an outlier and/or influential if C j > 1. 41,42 The influence of a study can also be assessed by the change that incurs to model fitting. We can compute the ratio of the determinants of the variance-covariance matrix of relative treatment estimates at iteration j to iteration (j − 1) 13 for NMA as .
A proof showing that these definitions (Cook's distance, ratio of determinants of the variance-covariance matrix of treatment estimates) generalize the classical measures to NMA is given in Appendix B2.

Heterogeneity and inconsistency measures
Based on the fixed effects model and assuming homogeneity and consistency in the whole network, the generalized Cochran's Q statistic is given by Krahn et al 43 Q total can be decomposed into two parts 43 : • a part coming from within designs (heterogeneity between studies that compare the same set of treatments), Q het • a part coming from between designs (inconsistency between studies that compare different sets of treatments), Q inc FE where the design of a study is called the set of treatments compared within the study in the context of NMA. 2,44 For the FS procedure, we monitor generalized Cochran's Q (Q total ) and the Q statistic within designs (Q het ). Moreover, the Q statistic (Q inc ) is monitored to assess consistency under the assumption of a full design-by-treatment interaction model with random effects. 45 The assumption of consistency can also be tested by comparison of direct and indirect estimates of the relative treatment effects. 46 We monitor the z − values of disagreement between direct and indirect evidence for each comparison to derive indirect estimates. 3

Backward search
We briefly describe the backward search method, which is compared with the FS method in the examples. The backward search starts by fitting the complete network and gradually deletes studies until some criterion is met. For instance, it starts by fitting the hypothesized model to all studies, calculates an objective function given by the median of (Equation 1) or (Equation 2) (eg, median absolute standardized residuals) and the study with the worst value (maximum of the median absolute standardized residual) is deleted. We proceed until some criterion is met (eg, all absolute standardized residuals are less than 2).

ILLUSTRATIVE EXAMPLES
We study the performance of the FS in detecting outlying studies using a simulated dataset as well as two real data examples.

Simulated dataset
We simulate a single NMA dataset with n = 4 treatments (A, B, C, and D) and N = m = 8 two-arm studies (Table A3). Treatment A is chosen as the reference treatment, the true relative effects are set , and according to the assumption of consistency, that is, XY = AY − AX . We then create a study with extreme effect size that compares the treatments C and D, y 8,CD ∼ N( CD + 4SD(y), 2 8 + 2 ), where SD(y) is the sample SD of the effect sizes from the first seven studies y = (y 1,AB , … , y 7,CD ). [49][50][51] The FS is conducted using R function NMAoutlier in R package NMAoutlier. 30 The median of absolute standardized residuals and the absolute standardized residuals (Equation 1) are used for choosing the initial basic subset and for progressing in the FS, respectively. The initial basic subset was selected among P = 100 candidate subsets of size l each, equal to the number of treatments, l = max (4, 0.2 × 8) = 4 studies. The initial subset consisting of studies 1, 3, 5, and 7, gave the lowest median absolute standardized residual. Table 2 gives the steps of the FS until all studies are included in the basic set. Based on the absolute value of the residuals, the studies entered in the following order: study 6 with an absolute residual of 2.64, study 2, study 4, and finally study 8. Figure A2 (left side) in Appendix provides the forward plot of standardized residuals for each iteration produced with fwdplot(). Study 8 has a large, standardized residual compared with the other studies and, thus, was detected as outlying. The backward search was also conducted and study 8 was the only one deleted.
We also added two more studies with extreme effect sizes (studies: 9, 10) which were generated with y 9,AB ∼ N( AB + 4SD(y), 2 9 + 2 ) and y 10,CD ∼ N( CD + 6SD(y), 2 10 + 2 ) with 2 9 , 2 10 ∼ X 2 1 ∕4 restricted to the interval (0.009, 0.6). For this simulation scenario (artificial extreme studies 8, 9, and 10 included in the data), FS was conducted using the same criteria with the case only one artificially outlier was included. Study 8 entered at iteration 5, study 10 at iteration 6, and study 9 at the last iteration. Moreover, studies 8, 9, and 10 provide large, standardized residuals compared with the other studies ( Figure A2, right side) and, thus, were detected as outliers.

Application 1: Interventions to aid smoking cessation
We applied the proposed FS to the network comparing interventions to aid smoking cessation. 17,32 The corresponding R code with the NMAoutlier 30 package is provided in Appendix A2 allowing the reproducibility of results. The initial basic subset was selected among P = 100 possible subsets of size l = 5 each using the absolute residual criterion. The FS steps were completed in 27 seconds*. Table 3 summarizes which studies were part of the initial basic subset (studies: 18,21,9,20,15) and the progression steps. The FS method was completed in 20 iterations and study 3 entered in the last iteration.
Confidence intervals of summary relative treatment effects between treatments B and C broaden in the last iteration ( Figure A3) due to the estimated 2 , which increased substantially in this iteration ( Table 3). The forward plot (Figure 3, right side) shows that the ratio of variances increased rapidly in the last iteration. However, the full interaction model does not provide evidence for inconsistency (Q inc = 4.66, p = 0.7). We monitored a large increase in estimated 2 , Q het , and Q net , but a reduction in Q inc in the final iteration (Table 3); inconsistency in the whole network is masked due to the large heterogeneity.   91 (1.12, 3.28) when study 3 is not included (iteration 19 of the FS algorithm). We observed sharp changes in the monitoring statistics through the FS search for study 3 (Figure 3).
Although the overall Q inc statistic did not suggest any inconsistency in the whole network, we noticed a sharp increase in Q inc when study 1, which compares A, C, and D, enters the basic set at iteration 15 (Table 3). A sharp change in Cook's distance was detected when study 1 entered at iteration 15 (Figure 3, left-hand side). The forward plot of TA B L E 3 Initial set and progression of the FS algorithm for smoking cessation data 17,32

Iterations
Study entering Q total Q inc Q het̂2 1 18, 21, 9, 20, 15 ( z − values (Figure 4) shows that study 1 is associated with large differences between direct and indirect evidence for "A vs D" and "C vs D" comparisons (z A vs D = 1.50, z C vs D = 2.20, at iteration 15). We conclude that study 1 influences the model substantially as it is responsible for design inconsistency in "A vs D" and "C vs D" effect sizes between the two-arm and three-arm studies. We observed negligible changes in inconsistency measures when the other three-arm study, study 2 with treatment arms B, C, and D, entered (iteration 17). This agrees with the conclusion given by Higgins et al 2 that there is a design inconsistency in effect sizes between two-arm and three-arm studies.
The changes incurred by studies 1 and 3 in the monitoring measures differ substantially from the changes incurred in the FS process by the other studies in the smoking cessation data. We also conducted a backward search method which completed within one iteration by deleting study 3. Study 1 was not identified as an aberrant study by backward methods. This gives a nice example of how the aberrant studies can be identified even if they do not have an extreme effect size or do not enter in the last iterations of the FS. It is common practice in the FS literature to check the robustness of results by repeating the FS search from random starting points (initial subsets). We repeated the FS 100 times from random starting points using P = 1 for each run. During monitoring, we noticed that study 3 entered in the last iteration of the FS 82 times, it was included in the initial subset 15 times and entered in an intermediate iteration 3 times. In these three instances, we noticed sharp changes in the monitoring measures when study 3 entered the search. When study 3 is included in the initial subset, we observed peculiar patterns in the monitoring statistics (such as the heterogeneity estimator) in the FS procedure. For example, the estimated heterogeneity for the initial subset was large and was subsequently reduced as the FS progresses ( Figure A4, left side). Moreover, Figure A4 right side shows that the standardized residual for study 3 decreased and got far away as other studies entered the search. For completeness, we employed variations of the FS algorithm (different methods for selecting the initial subset, progressing and statistics monitored) but all methods led to the same conclusions. In addition, repeating the FS whilst including study 1 in the initial set did not affect the outlying diagnosis for study 3.

Application 2: Interventions for actinic keratosis
The FS is also applied to the network of 35 studies for actinic keratosis. 34 The design-by-treatment interaction model (Q inc = 23.05, df = 7, p = 0.001) showed statistically significant inconsistency. The between-designs Q inc statistic indicated that the dataset provides evidence of consistency when the design including treatments 1 vs 6 vs 8 (observed only in study 28) was detached (Q inc = 10.18, df = 5, p = 0.07) ( Table A4). The initial subset was selected among P = 100 subsets of size l = 9 each using the smallest absolute residual criterion. The FS was completed after 27 iterations at 59 seconds. A sharp increase in the Q inc statistic (from 3.68 to 23.05) occurred when study 28 entered in the last iteration ( Figure A5) indicating that study 28 is a potential source of inconsistency. Sharp changes occurred in the forward plots for Cook's distance and the ratio of variances when study 28 entered the search ( Figure A6). After removing study 28 from the dataset, the design-by-treatment interaction model (Q inc = 3.68, df = 5, p = 0.59) indicated no statistically significant inconsistency. The FS and the backward search led to different conclusions this time. The backward search removed studies 24, 23, 22, and 21 in turn until all included trials had absolute standardized residuals less than 2. This is an example of a case where forward and backward methods give different results. Study 28 is mainly responsible for inconsistency and a study with effects different than those estimated indirectly for the respective comparisons does not necessarily have a large residual and cannot, therefore, be detected by backward methods.

Application 3: Antihypertensive strategies for heart failure
We applied the FS method to the network of 26 studies comparing antihypertensive strategies for heart failure. 35 Figure A1, right side). Therefore, studies 23 and 26 have an impact as they increase the variance and influence the model parameters by giving less precise results.

DISCUSSION
In NMA, there is a lack of visual tools to identify extreme study effects. Overall, the FS provides a practical set of visual diagnostic tools that can help us not only identify outliers but also those studies responsible for differences between direct and indirect estimates. For transparency and reproducibility purposes, we developed the R package NMAoutlier. 30 Many decisions are required to apply the algorithm: the size of the initial subset, the number of initial subsets to be examined, criteria for choosing the initial subset, criteria of progressing in the search, and the statistics to be monitored. We ran many FSs for the examples presented in the article, using several combinations of the methods available, and we found results to be robust. Three common criticisms to the method are: (i) why not employ the more popular backward selection methods, (ii) what happens if an outlier is included in the initial subset or does not enter in the last iterations, and (iii) how do we know a change in the monitored statistics is not due to chance?
Regarding the first criticism, backward methods are known to behave poorly in the presence of multiple outliers that may affect mean values to such a degree that they do not seem to be outliers anymore (masking effect). In the meta-analysis literature, a common strategy is to exclude all studies, one at a time, and see the impact on results (parameter estimates, heterogeneity, inconsistency, and so on). Hence, the computational burden is bigger than the FSs, the method is sensitive to masking, and it is not known whether monitored changes are due to chance or not. In addition, we saw in the actinic keratosis example that the backward method failed to identify the study responsible for the inconsistency in the network. Deletion diagnostics that are not based on residual values could have potentially located the problematic study, but it is overall time-consuming and complicated to apply several deletion strategies. The FS provides a breadth of information regarding the structure of the data and the impact of each study on various aspects of the NMA model. A careful investigation of the search and its repetition from random starting points can help identify atypical studies irrespectively of the stage at which they enter the search.
Regarding the second criticism, we argue that even if outlying studies do not enter towards the last iterations of the search, we may be still able to spot them through sharp changes in the forward plots of estimated heterogeneity and standardized residuals. For example, when an outlier enters the initial subset, it is common to start with large heterogeneity estimates that gradually decrease. Another benchmarking technique we can use concerning this criticism is to run the FS several times from random starting points as a sensitivity analysis. The running time for the FS algorithm depends on how large the network and the initial subset are. We also suggest that, once we consider some studies to be aberrant, the FS should be repeated but this time forcing the aberrant studies to be included in the initial subset. When outliers are included in the initial subset, it is typical to see some "undesirable" statistics at the beginning (eg, large heterogeneity/inconsistency) that then improve as studies are included in the "basic" set.
The most serious concern is the third: how do we know that changes in the forward plots are not due to chance? One method suggested in the literature is the construction of simulation envelopes that give, at each iteration of the FS, the bands within which we expect statistics to lie if there are no outliers in the data. To construct these bands through simulation, we would need to run the FS hundreds of times, increasing the computational time. There is currently a lot of work in constructing these bands without simulation and we aim to equip the FS for NMA with this capability in the future. Johansen et al provided the asymptotic distribution of scaled forward residuals 52 offered in R package ForwardSearch. 53 Of course, the same problems apply also to the backward methods.
A challenging issue is to delineate the relationship of outlying studies with heterogeneity and/or inconsistency. Outliers may cause heterogeneity/inconsistency, but they can also be masked by them. If a comparison is not informed by both direct and indirect evidence, then it is judged merely by the magnitude of the reported effect sizes.
Throughout this article, we have focused on detecting outliers at the study level. Studies give aggregate measures, which may have been influenced by the presence of outliers or data extraction errors within the study. The FS algorithm can also be extended to meta-analysis with multiple outcomes, a meta-analysis of diagnostic accuracy studies, or individual participants' data meta-analysis.

Recommendations
The proposed method aims to detect outlying and influential studies and should be used cautiously, recognizing that it is a diagnostic method and not to be used for throwing out studies depending on results. We give some guidance on how to interpret results from applying the FS algorithm.
Make a priori decisions for technical aspects of the FS algorithm. Technical aspects of the FS algorithm should be defined in advance. These are: the size of the initial subset, the number of initial subsets P to be examined, criteria for choosing the initial subset, criteria of progressing in the search, and the statistics to be monitored.
Run the FS algorithm and monitor the forward plots. Proceed in applying the FS algorithm with the a priori decisions taken. NMAoutlier package has sensible defaults for some of these technical settings (these are: the number of initial subsets to be P = 100, the median of the absolute standardized residuals for the criteria for choosing the initial subset and the criteria of progressing in the search). The main results for the FS algorithm are the ordering of the studies that enter the search and several monitoring statistics. Depending on the technical aspects of the search, it is likely that the method can produce different output results. But even if we proceed with the same criteria (those for choosing the initial subset, criteria of progressing in the search), FS can provide different ordering of the studies entering at each iteration and accordingly different results of the monitoring measures in each run. We have observed that in most cases outliers enter at the last iterations. However, there are exceptions and outliers may enter at any iteration (even included in the initial subset). By monitoring various statistics (eg, Cook distance, the ratio of variance, residuals, Q statistics, model parameters), we can look for sharp changes or other patterns that will be indicative of the same potential outliers at any FS run at any step of the search. We can also look at forward plots of residual values. If an outlier enters early in the search or is included in the initial subset, we expect that its residual values will keep increasing as studies that are very different (and not outlying) enter the search. We also expect that heterogeneity will keep decreasing in such a case.
Run the FS algorithm from different random starting points and compare the results. We suggest rerunning the FS 5 to 10 times from random starting points (initial subsets) to explore the robustness of the ordering and to avoid results driven by chance. Even when outliers are included in the initial subset, it is typical to see large changes in statistics at the beginning of the search (eg, large heterogeneity/inconsistency) that then improve as studies are included in the "basic" set.
Run the FS algorithm using different criteria. We suggest rerunning the FS with smaller or larger initial sample sizes or using a different method for progressing in the search to explore the robustness of results.
Have a closer look at potentially outlying studies that the FS algorithm identified and interpret the results. There is not a definite "decision pathway" for the interpretation of the results; instead, one should critically review the potentially outlying studies as indicated by the FS algorithm and have a look at the original data and/or the graphical tools (eg, comparison-adjusted funnel plot). One should think of a broad range of possible explanations, from a possible data extraction error (for data or eligibility) to factors that could introduce heterogeneity or inconsistency. We can further use inconsistency diagnostics (eg, node/side-splitting or net heat plot) and explore whether outliers are responsible for inconsistency.

CONCLUSION
In conclusion, we argue that the method can be employed as a diagnostic tool to provide a comprehensive outlier detection analysis as it can offer information about the data and detect extreme study effects responsible for heterogeneity and inconsistency.