Comparison and validation of metadta for meta-analysis of diagnostic test accuracy studies

We developed metadta , a flexible, robust, and user-friendly statistical procedure that fuses established and innovative statistical methods for meta-analysis, meta-regression, and network meta-analysis of diagnostic test accuracy studies in Stata. Using data from published meta-analyses, we validate metadta by comparing and contrasting its features and output to popular procedures dedicated to the meta-analysis of diagnostic test accuracy studies; ( midas [Stata], metandi [Stata], metaDTA [web application], mada [R], and MetaDAS [SAS]). We also demonstrate how to perform network meta-analysis with metadta , for which no alternative procedure is dedicated to network meta-analysis of diagnostic test accuracy data in the frequentist framework. metadta generated consistent estimates in simple and complex diagnostic test accuracy data sets. We expect its availability to stimulate better statistical practice in the evidence synthesis of diagnostic test accuracy studies.


What is New
We validated metadta by comparing and contrasting its functionality, robustness, accuracy and presentation of results with existing procedures; mada, MetaDTA, metandi, midas and MetaDAS. metadta replicated analyses performed by the existing procedures. We demonstrated metadta's ability to perform network meta-analysis of DTA studies without imputation of data for which no alternative procedure in Stata exist.

Potential Impact
The merits of metadta are efficient operation with minimal user intervention, flexibility to deal with complex data sets and enhanced presentation of the results in rich reports and graphics. We expect its availability to the scientific community to stimulate better practice in the meta-analysis, network metaanalysis, and meta-regression of DTA.

| INTRODUCTION
The bivariate random-effects model for meta-analysis (BRMA), 1 a generalized linear mixed model (GLMM), 2 has been recommended 3 for meta-analysis of diagnostic test accuracy (DTA) studies. Fitting GLMMs properly and processing the model estimates is a relatively complex task for applied statisticians, general researchers, and systematic reviewers.
Several statistical procedures are dedicated to the meta-analysis of DTA using the BRMA. However, they have limited capabilities and functionality; MetaDTA, 4 metandi, 5 and midas, 6 synthesize data from one diagnostic test, and MetaDAS 7 and mada, 8 accept one covariate in meta-regression. Furthermore, the default options in some of these procedures could be more robust for them to be used reliably with minimal effort.
There are many studies where more than two tests are evaluated simultaneously. Network meta-analysis (NMA) extends and improves the models in a conventional meta-analysis by accounting for the potential correlation between the multiple diagnostic tests assessed in a given study. There are two classes of models for diagnostic test accuracy network meta-analysis (DTA-NMA); contrast-based (CB) and arm-based (AB). They mainly differ in the basic parameters being modeled.
Most models for DTA-NMA were developed in the Bayesian setting. 9 A few models have been translated into general statistical procedures that can also be applied to DTA-NMA within the frequentist setting. The mixmeta 10 procedure offers a flexible platform for meta-analysis based on linear mixed-effects models (LMM) in R. mvmeta 11,12 is another procedure in R and Stata implementing a generic multivariate normal meta-analysis model. However, no procedures dedicated to DTA-NMA in the frequentist setting based on the GLMM exist.
We built further on the BRMA and developed metadta, 13,14 a procedure in Stata, to efficiently perform simple and advanced statistical tasks in synthesizing DTA data.
Our aim in this paper is to compare and contrast metadta's features and output with existing procedures dedicated to the meta-analysis of DTA studies. We also demonstrate the use of metadta in DTA-NMA.
The rest of the paper is as follows. To set the stage, we start with a discussion on the different data structures in the meta-analysis of DTA studies. This will be followed by a brief description of the features of metadta. We will then present the key properties of the existing statistical procedure used to validate metadta. Afterward, we will present the output from applying metadta and the other procedures to previously published meta-analyses. The last section concludes with some discussion.
Procedures developed for meta-analysis of DTA studies in the Bayesian framework are outside the scope of this paper.

| DATA STRUCTURES IN A META-ANALYSIS OF DTA STUDIES
Data from DTA studies result from a 2 Â 2 crosstabulation of the index test versus the reference standard results. The data in the four cells represent the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The sum of TP and FN represents the number with a condition, and the sum of TN and FP represents the number without the condition.
This information is stored in a data set with at least five columns. Four columns contain data on each of the four cells. A fifth column contains the study identifier.
When present, the information on potential sources of heterogeneity is stored in additional columns.
We distinguish five important terms based on the data format required by metadta and the type of inference; general meta-analysis, comparative meta-analysis, AB-NMA, CB-NMA, and stratified meta-analysis.
There are several flavors of general meta-analysis. The classic and the most basic meta-analysis uses data from independent studies to assess the diagnostic accuracy of a single test. In this case, each study contributes one row to the data set. The accuracy of the test can also be evaluated in different patient populations or at different threshold values. In this case, a study can contribute more than one row to the data set. The different results define a cluster of "repeated measurements" nested in a given study. In this setting, it is also possible to compare multiple tests even if the included studies evaluated only one of the index tests.
In this paper, we will use the term "comparative meta-analysis" specifically to refer to the situation where two diagnostic tests (or methods, etc.) are evaluated in each study. Each study contributes two rows of data to the data set if the data is entered in the long format. The pairing of the data is essential in computing the studyspecific relative sensitivity and relative specificity.
To perform an NMA, there need to be at least three competing tests evaluated. To fit an AB-NMA model, the data for each competing test are entered in a long format. Each study contributes at least two rows to the data set. To fit a CB-NMA model, the data are entered in the wide format: four columns for the index/candidate test and the following four columns for the comparator test. Here, each study contributes at least one row to the data set for each index versus comparator test comparison.
Suppose it is desired to fit separate models to each subgroup of studies, for example by a test, we refer to this as stratified meta-analysis.

| FEATURES OF METADTA
By default, metadta fits a logistic-normal regression model (a modified or extended BRMA). Whenever there are fewer than three studies in an analysis, the procedure fits the logistic regression model by leaving out the random components from the predictor equation.
The unstructured variance-covariance structure is typically imposed on the BRMA to quantify the unexplained between-study variation and correlation. metadta also allows other restrictive but reasonable variance-covariance structures; identity, exchangeable, and independence.
The procedure also allows stratified meta-analysis. In meta-regression, both categorical and continuous studylevel covariates are permitted. There is no limit on the number of covariates allowed but only the interaction between the first covariate and the rest is permitted.
The implemented AB-NMA model is a frequentist translation of the model by Nyaga, Arbyn and Aerts 15 developed in a Bayesian setting. This parameterization of the AB-NMA model is logical and unconstrained. Furthermore, obtaining the marginal absolute and relative sensitivity and specificity as functions of the model parameters is easy.
When the competing tests are a set of index/candidate tests and a set of equivalent comparator tests, the procedure can switch to CB-NMA. Here, we propose to model the differences in sensitivity and specificity between the index tests, a common fixed/random relative sensitivity and specificity of the index test relative to the comparator test, and the sensitivity and specificity of the comparator test, all in the logit scale. The CB-NMA model fitted by metadta differs from previously proposed CB-NMA models. 16,17 The proposed model specification and a description of how it differs from other CB-NMA models can be found in the Supporting Information.
After fitting the specified regression model, metadta computes the summary absolute and relative sensitivity and specificity using marginal standardization. [18][19][20] The results are presented in tables, a forest, and SROC or crosshair plots.
The SROC plot displays the SROC curve, confidence and prediction region. Cross-hairs are displayed to indicate the confidence intervals (CI) of the summary point estimate of the sensitivity and specificity. They are showed whenever the confidence region cannot be computed due to insufficient data, that is, when there are fewer than three studies. If the BRMA model is fitted without covariates, the I 2 statistic by Zhou and Dendukuri 21 is computed and presented. We will refer to this statistic as Zhou I 2 . The statistic accounts appropriately for the variability of the within-study variance across studies, the mean-variance relationship across studies, and the correlation between sensitivity and specificity.
If data were entered in the wide format or the covariate of interest only has two levels and data were entered in the long format, metadta can produce a forest plot of the study-specific relative sensitivity and specificity. In an AB-NMA, the procedure can generate a forest plot of the relative sensitivity and relative specificity of each index test relative to a specified baseline index test, which is automatically assigned by the procedure or explicitly specified by the analyst.
In meta-regression, model comparison is performed by leaving out some (one covariate or some interaction terms) and all terms from the model. Once the model fitting is completed, metadta stores the model estimates for further hypothesis testing and model selection using likelihood-ratio tests and Akaike information criteria (AIC) or Bayesian information criteria. metadta was developed for users of Stata 14 onward. The computational engines that power the randomeffects and the fixed-effects models are the native Stata commands meqrlogit and binreg, respectively. meqrlogit uses the QR decomposition (decomposition of a matrix A into a product A = QR of an orthogonal matrix Q and upper triangular matrix R) of the variance components. It implements the Newton-Raphson algorithm (a secondorder optimization [SDO]) as the default algorithm to maximize the likelihood function. This algorithm uses the Hessian matrix rather than its approximation. The two alternatives are both first-order optimization (FDO) algorithms. In binreg, the likelihood function is also maximized using the Newton-Raphson algorithm by default. The alternatives are an SDO and two FDO algorithms.
A detailed description of the methods implemented in metadta may be found in the Supporting Information. Table 1 summarizes the key properties of the five statistical packages in this comparison.

| EXISTING PACKAGES DEDICATED TO THE META-ANALYSIS OF DTA STUDIES
MetaDAS 7 fits the hierarchical summary receiver operating characteristic model(HSROC) 22 and the BRMA model in SAS. The HSROC model focuses on estimation of an SROC curve across different test thresholds. The procedure allows stratified analysis and meta-regression with one covariate. The outputs are tables of the parameter estimates, and summary estimates of sensitivity and specificity along with their 95% CIs. Neither the forest plot nor the SROC plot is generated. In meta-regression, the summary estimates of relative sensitivity and specificity are also presented. MetaDAS is powered by PROC NLMIXED, which implements seven different optimization techniques. The quasi-Newton (FDO) algorithm is the default. The alternative algorithms are two FDO's, three SDO's, and a derivative-free optimization (DFO).
MetaDTA 4 is an interactive web application that fits the BRMA. The application displays a table of the parameter estimates, a forest plot, and the SROC plot. It can perform sensitivity analysis to explore the effect of outlying studies but not meta-regression. When provided, the covariate information is used for graphical esthetics only. MetaDTA is powered by lme4, 23 which implements a DFO algorithm. mada 8 is an R procedure that fits the bivariate normalnormal by Reitsma et al., 24 the predecessor to the BRMA. Its implementation models logit sensitivity and logit (1 À specificity). The normal-normal model requires a continuity correction when any TP, FP, FN, or TN is zero. Unlike the BRMA, the likelihood function of the normalnormal model has a closed-form solution and hence does not need to be approximated. The outputs of the procedure are tables of the model coefficients, a forest plot, and an SROC plot. The SROC plot does not have the prediction region. The procedure allows one covariate in metaregression and stores the estimates to facilitate model comparison. mada's computational engine, mixmeta, 10 has implemented a quasi-Newton algorithm (an FDO); the penalized iteratively reweighted least squares algorithm. 25 meta4diag 26 and bamdit 27 also fit the BRMA model in R in the Bayesian setting which is beyond the scope of this paper.
The computational engines in MetaDAS (sometimes), metadta, and mada use the Cholesky parameterization of the BRMA to ensure that the negative of the Hessian or the approximate Hessian is well-conditioned. mada uses the QR decomposition of the variance components.
In Stata, the metandi 5 procedure fits the BRMA. It displays a table of the summary diagnostic measures and an SROC plot. The computational engine in metandi in Stata 8 and 9 is gllamm 28 which implements a modified Newton-Raphson algorithm (SDO). A second computational engine, xtmelogit, was introduced in Stata 10. In Stata 10 and above, the default engine is xtmelogit. In Stata 13, the xtmelogit procedure was renamed to meqrlogit. As of Stata 16, meqrlogit is no longer official part of Stata but continues to work. midas 6 is another Stata procedure implementing the BRMA. It generates more graphical output to explore the goodness of fit, publication, and other precision-related biases. The analysis outputs are tables, a forest plot, and an SROC plot. The procedure allows for univariate and univariable meta-regression and quantification of heterogeneity using the I 2 statistic by Higgins et al. 29 We will refer to this statistic as Higgins I 2 . Each covariate provided to the procedure should be continuous and meancentered or otherwise coded 0/1. The computational engine in midas is xtmelogit.  We chose to use this data set because it has been used elsewhere 24 to demonstrate convergence issues in the BRMA. Non-convergence is a known issue with the BRMA when the between-study correlation is on the boundary of its parameter space, as data become sparse (sensitivity or specificity close/equal to 0/1), or as the within-study variation increase. 31 The parameter estimates from the six procedures are presented in Table 1. The correlation parameter was estimated as À1. Using the QR decomposition of the variance-covariance matrix may aid in convergence when the between-study correlation is on the boundary of its parameter space. 10 This may explain why metadta and midas had no problem generating the results. Except mada, the other procedures had difficulties generating results with the default options.
At first, metandi failed to yield results because "Hessian became unstable or asymmetric". Specifying a small number of quadrature integration points also leads to convergence problems and unreliable estimates of the between-study variances. Higher values should result in greater accuracy. 32 By increasing the quadrature integration points from 5 (default) to 7 (default in metadta), we obtained results from metandi.
MetaDAS also failed to yield results with the default options. The procedure selects the number of quadrature points adaptively by evaluating the log-likelihood function at the starting values of the parameters until two successive evaluations have a relative difference less than the value of 0.0001 (the default). From personal experience, the quadrature points used by MetaDAS are often less than 7. Even after increasing the quadrature points to 7, MetaDAS could not produce any output with the default optimization algorithm. All other optimization algorithms failed except the Nelder-Mead simplex optimization, a DFO algorithm.
The summary sensitivity and specificity estimates and the 95% CIs from metadta, metandi, and midas were similar. MetaDTA had similar central estimates but very small standard errors for the logit sensitivity and logit specificity. MetaDAS had comparable summary sensitivity and specificity estimates but huge standard errors for the logit specificity. The parameter estimates from mada differed from the other procedures but were similar to those from the original meta-analysis. The estimates for the between-study variance from metadta, metandi, and MetaDTA were similar. MetaDAS reported the largest estimates for the between-study variance parameters.
The between-study variance estimates are difficult to interpret because they vary from 0 to ∞ and are estimated on the log odds scale. The I 2 is a more interpretable statistic. It indicates the percentage of the total variability in logit sensitivity and logit specificity which is attributable to heterogeneity between studies.
According to metadta, there was more heterogeneity in logit specificity (τ 2 ¼ 3:31,Zhou I 2 ¼ 60:30%) than logit sensitivity (τ 2 ¼ 0:18, Zhou I 2 ¼ 50:62%Þ. Despite the substantial heterogeneity in both dimensions, the bivariate Zhou I 2 was 0:01%. A low bivariate I 2 is expected because the generalized between-study variance in logit specificity and logit specificity goes to zero with (nearly) perfect correlation. 21 midas computes the I 2 for the diagnostic odds ratio based on the normal-normal model. The specification tend to underestimate the expected value of the withinstudy variance in binomial data resulting in high values of I 2 . This could lead to an incorrect conclusion of very high heterogeneity. 21 This may explain why the Higgins I 2 from midas (98% [95% CI: 96, 99]) was very high compared to that from metadta.
DFO and FDO algorithms are less reliable than SDO algorithms and are likely to terminate at a point too far from the optimal point. 33 This may explain why the estimates for the between-study variances from MetaDAS and the standard errors for the logit specificity and logit sensitivity from MetaDTA differed from metadta, midas, and metandi.
Approximating the within-study variability by a normal distribution leads to a downward bias in the estimates for sensitivity and specificity and their corresponding between-study variances. 34 This may explain why mada reported the lowest estimates for sensitivity, specificity, and the between-study variance of logit specificity. Figure 1 shows the forest plot of sensitivity and specificity from metadta. The forest plot from midas was similar but is not presented here. MetaDTA and mada produced two separate forest plots, one for sensitivity and the other for specificity, which we combined into one graph (see Figures 2 and 3, respectively).
In the forest plots, metadta and midas presented the 95% Clopper-Pearson confidence intervals. mada presented the 95% Wilson confidence intervals on the "corrected" data. By default, all cells in all the studies had a 0.5 adjustment. This caused the difference in mada's and MetaDTA's forest plots. The Wilson confidence intervals are generally shorter than the Clopper-Pearson confidence intervals, with an observed T A B L E 2 Parameter estimates and the corresponding 95% confidence intervals in brackets-a meta-analysis of diagnostic accuracy of telomerase for primary diagnosis of bladder cancer.    Figure 4 presents the generated SROC plots. The SROC curve from metadta and metandi are very similar. The SROC curve from MetaDTA appears shorter, while the one from midas includes all points in the range (0, 1) even when such points are not present in the data. The curve from mada is different because the differences in the parameter estimates in Table 2 are propagated into the derivation of the SROC curve.
The 95% confidence regions presented by metadta, midas, and metandi appear similar. The prediction regions shown by metadta and metandi also appear similar. The prediction region presented by midas is vast and seem inverted. We suspect an error in midas in computing the prediction region. The 95% confidence and prediction regions shown by MetaDTA are very thin, possibly because the variance of the mean logit sensitivity and mean logit specificity were practically zero (see Table 2). mada did not present the prediction region.

| Independent studies-direct comparison of the results
Ritchie et al. 36 performed a Cochrane review on the accuracy of amyloid-beta protein in cerebrospinal fluid to identify patients with mild cognitive impairment who would develop Alzheimer's disease dementia or other forms of dementia. The data set consisted of 14 studies. They fitted an HSROC model using the NLMIXED procedure in SAS.
We choose this data set because Freeman et al. 4 used it to demonstrate MetaDTA giving us a direct comparison of the results.
The parameter estimates are presented in Table 3. The authors 36 did not estimate the summary sensitivity and specificity due to threshold variations. In this analysis, we present them to replicate the analysis by Freeman et al. 4 without drawing conclusions beyond the original publication.
The estimates for the summary sensitivity and specificity and the between-study heterogeneity from all the procedures were very similar except for mada. As in the previous example, the use of the normal-normal model and the continuity correction caused the difference in the results. The slight differences in estimates from the other procedures are due to differences in the approximation of the BRMA likelihood and the optimization algorithms.
The SROC plots are presented in Figure 5. The shape of the SROC curves from metadta and MetaDTA appear similar. The SROC curve from metandi extends beyond the 95% prediction region to the left, while the curve from midas extends beyond the 95% prediction region on both ends. metandi and midas compute the SROC curve for all plausible values, that is, [0, 1]. metandi then truncates the curve on the observed sensitivity values, that is, the minimum and the maximum sensitivity. In contrast, the SROC curves from metadta, MetaDTA, and mada are based on the observed specificity values. The shape of the 95% confidence regions appears similar except for mada. The prediction regions from all the procedures also appear similar except midas.
The forest plots generated by metadta, mada, midas, and MetaDTA had similar characteristics as in example 1 but are not presented here.

| Comparative meta-analysis-Inclusion of covariate information
Arbyn et al. 37  In this re-analysis, we considered procedures that give a user an option to include information from a covariate. This criteria excluded metandi but included midas and MetaDTA which have no functions for meta-regression. Table 4 presents the generated parameter estimates from the reanalyses.
The estimates for the absolute and the relative sensitivity and specificity and the between-study variance and correlation from metadta and MetaDAS were very similar despite the difference in the optimization algorithm.
The absolute sensitivity and specificity estimates from mada were surprisingly very similar to those from the original analysis despite the difference in the fitted model. mada presented the model coefficients in the logit scale. In an excel sheet, we performed the transformations to the probability scale but could not calculate the 95% confidence intervals for RepC because its standard error was not supplied. The transformed model coefficients refer to the median sensitivity and specificity in contrast to the population-averaged estimates reported by metadta and MetaDAS. The relative sensitivity and specificity estimates are missing from Table 4 because mada did not derive them.
T A B L E 4 Parameter estimates and the corresponding 95% confidence intervals in brackets-a meta-analysis of diagnostic accuracy of human papillomavirus testing and repeat cytology to triage of women with an equivocal Pap smear to diagnose cervical pre-cancer.

Software
midas performed a subgroup analysis. It fitted two models, one for HC2 and the second for RepC. This explains why the estimates for the absolute sensitivity and specificity for HC2 and RepC differ from those from metadta, even though the two procedures use the same optimization procedure. By letting the model borrow information from all studies through the correlation structure, meta-regression can yield more precise results than sub-group analysis. This may partly explain why metadta's 95% confidence intervals were shorter.
The estimates for the absolute sensitivity and specificity of HC2 and RepC from MetaDTA are missing because it fitted a model without covariates. The covariate information was used in the SROC plot to differentiate between the two groups in the data set (see Figure 6).
midas and MetaDTA treated the observations as independent studies even though there were two dependent observations from each study. Consequently, even though the data came from 10 studies, the procedures treated the data as if from 20 different studies. This explains why their between-study correlation estimate Mis-specifying the random effects distribution has a potential for a drop in efficiency in the prediction of the random effects and the estimation of other parameters. 38 This may explain why the estimates for the overall sensitivity and specificity from midas and MetaDTA were very similar to those from metadta but with wider 95% confidence intervals.
The smaller estimates for the between-study variance from metadta and MetaDAS compared to MetaDTA is a manifestation of how meta-regression diminishes heterogeneity through model-based adjustments. 39 Figure 6 presents the generated SROC plots. metadta produced an SROC plot for RepC and HC2 in one graph, while midas and MetaDTA produced an "overall" SROC plot. Both metadta and MetaDTA identify the two groupings in the data set in their respective SROC plots. The mada documentation indicates that "once in meta-regres-sion…, one cannot reasonably plot SROC curves, since fixed values for the covariates would have to be supplied". The covariate values are fixed with one categorical variable and need not be provided. Nonetheless, mada does not generate an SROC plot for this simple case.
Since the data in this example were from paired studies, that is, two tests conducted in the same study, a forest plot of the relative sensitivity and specificity may be more valuable and informative than the absolute sensitivity and specificity. The forest plot from metadta with the relative sensitivity and specificity is presented in the left plot of Figure 7. midas generated a forest plot of the summary absolute sensitivity and specificity for each covariate level (in Figure 7, right). MetaDTA generated separate forest plots of the absolute sensitivity and specificity which are not presented here. is detected in 99% of cervical tumors, testing for high-risk HPV has become the new standard screening strategy.
We re-visit the Cochrane review by Koliopoulos 40 where they assessed the accuracy of HPV testing (HC2) as an alternative screening test to cervical cytology (liquid-based [LBC] and conventional [CC]) in the general population. The review included paired studies where all participants received HPV testing and cervical cytology.
The covariate of interest was the type of testing (HC2, CC, and LBC). The potential sources of heterogeneity considered were: the geographical location where the study was done, the age limits of the study population, and the likelihood of verification bias.
The authors intended to use a mixed model that would use all the available data. Because of the limited functionalities in the existing procedures, they instead performed multiple pairwise comparisons for each potential source of heterogeneity. They used two different statistical software and four different procedures to synthesize the data. When there were more than four studies, they used MetaDAS. The PROC NLMIXED in SAS was used when there were three studies with the independence variance structure imposed between logit sensitivity and logit specificity. When there were only two studies, they used metaprop 41 to obtain the absolute sensitivity and specificity separately and metan 42 to obtain the relative sensitivity and specificity.
We reanalyzed the data using metadta only. We limited ourselves to the studies assessing the accuracy of CC versus HC2 (1 pg/mL) and LBC versus HC2 (1 pg/mL) in the detection of high-grade cervical intraepithelial neoplasia grades 2 or worse (CIN2+) in atypical squamous cells of undetermined significance (ASCUS+). There were 17 studies; nine assessed the accuracy of CC versus HC2 while the rest LBC versus HC2. Two studies evaluated the accuracy of both CC versus HC2 and LBC versus HC2. The network plot is presented in Figure 8.
The authors reported that the relative sensitivity and specificity of HC2  To directly compare the results from the original analysis, we first performed a stratified comparative meta-analysis. In one command, metadta fitted two different models; one comparing the accuracy of HC2 to CC and the other HC2 to LBC. Figure 9 presents the generated forest plots of the relative sensitivity and specificity of HC2 versus CC (top) and HC2 versus LBC (bottom). The summary relative sensitivity and specificity were similar to those from the original analysis except for the shorter 95% confidence interval of the relative sensitivity of HC2 versus CC from metadta.
To use all the data in one model, we then performed a CB-NMA with HC2 as the common comparator test and CC and LBC as the index tests. From this analysis, the relative sensitivity and specificity of HC2 Unlike in the previous analysis, one forest plot of the relative sensitivity and specificity was generated (see Figure 10) with grouping by the index test (LBC and CC).
metadta conducted a hypothesis test to examine whether the obtained relative sensitivities and specificities differed. From this test, the two relative (HC2 vs. CC and HC2 vs. LBC) sensitivities (p = 0.10) and specificities (p = 0.09) were not different. The average relative sensitivity and specificity were 1.29 [95% CI: 1. 16 Figure 11 presents the SROC plot (right) and the forest plot of the summary relative sensitivity and specificity (left). By glancing at the SROC curves, HC2 had superior discriminatory power, followed by CC and LBC.
Compared to the CB-NMA, the AB-NMA and the stratified comparative meta-analysis did not generate an overall relative sensitivity and specificity of HC2 relative to both CC and LBC. The estimates from the AB-NMA were close to those from the stratified comparative meta-analysis than those from the CB-NMA. Compared to the estimates from the stratified comparative meta-analysis, the estimates from the AB-NMA were smoothed by allowing the regression coefficients to be mutually informed by covariates and the correlation structure between logit sensitivity and specificity. In contrast to the AB-NMA, the CB-NMA meta-analysis "shrunk" the estimates to accommodate the assumption of the common "baseline" sensitivity and specificity in all studies.
The AIC from the AB-NMA and CB-NMA were 813.20 and 829.15, respectively. This indicates that the AB-NMA model fitted the data slightly better. That said, the CB-NMA provides a direct answer on the accuracy of HPV testing as an alternative screening test to cervical cytology: that HC2 was 29% more sensitive with a 5% loss in specificity.

| DISCUSSION
In the previous sections, we compared and contrasted the functionality, robustness, and accuracy of metadta, mada, MetaDTA, metandi, midas, and MetaDAS in evidence synthesis of DTA studies. We explored how the likelihood function of the models fitted to the data was approximated, the optimization technique used to obtain the parameter estimates, the complexity of the data set that can be handled, the processing of the model estimates and the presentation of the results. Using data sets from four already published meta-analyses, we showed that metadta's performance was satisfactory. We replicated the analyses performed by the existing procedures and showed how to fit NMA without imputation of data for which no alternative procedures in Stata exist.
There is no consensus on the recommended procedures for meta-analysis of DTA studies. 43 To validate metadta, we aimed to include the most popular procedures dedicated to the meta-analysis of DTA studies using the BRMA in commercial (SAS and Stata) and open-source (R) statistical software. Some have recommended 44,45 mada, MetaDAS, midas, and metandi. The Screening and Diagnostic Tests Methods Group 46 recommends MetaDAS, metaDTA, midas, and metandi. Although mada does not implement the exact same model as the other approaches, since it uses normal approximations to the binomial likelihoods, we choose to include it in this comparison because it was the only "comparable" procedure available in R. Furthermore, as observed in Section 5.2, when approximation to the normal distribution performs well, differences in the parameter estimates are barely noticeable. 47 While metatron 48 implements the BRMA in R and is recommended for Cochrane Reviews, 46 we did not include F I G U R E 1 1 SROC plot and forest plot of the summary relative sensitivity and specificity-arm-based network meta-analysis of accuracy of HPV testing as an alternative screening test to cervical cytology. [Colour figure can be viewed at wileyonlinelibrary.com] it because all attempts to obtain estimates in Section 5.1 failed. Furthermore, the procedure is no longer available from the CRAN repository. metadta implements both AB-NMA and CB-NMA. Veroniki et al. 9 provide an overview of alternative models for DTA-NMA. Our choice for the NMA models implemented in metadta was simplicity in implementation and direct interpretation of the parameters. It is argued that the assumption of exchangeability of the absolute effects in the AB models cannot be guaranteed. 49,50 and by modeling the contrasts, the CB models are said to preserve the force and validity of a randomized trial in meta-analysis. 51 There exists no formal test to validate the assumptions in the two classes of models. As the debate between the advocates for either class continues, 49,52 the analyst choice can depend on the target parameter of interest and how best the model fits the data.
Graphical output eases exploration, interpretation and communication of the results from the meta-analysis. The SROC plot is highly informative. In Sections 5.3 and 6, metadta revealed the usefulness of the SROC plot in summarizing and communicating the results from multiple tests in a compact way. At a glance, it summarized the location of the individual study results and their covariation alongside the summary meta-analysis results. There is a division of opinions on whether to display the SROC curve after the BRMA. Unlike a traditional ROC curve which describes the impact of threshold on the accuracy, the SROC curve in a meta-analysis of DTA summarizes the central tendency of a set of accuracy reports. 53 In Sections 5.3 and 6, it was easy to point out the test with the best discriminatory power by picking the highest SROC curve from the metadta's SROC plots.
A unique feature of metadta is the ability to include multiple covariates and interaction terms. In practice, the number of covariates and interactions that can be incorporated in a given model is limited. First, a large number of studies is required to obtain precise estimates considering that the basic BRMA without covariates already contains five parameters. The general recommendation is at least 10 studies for each covariate. 54 That said, because of the assumed correlation in the BRMA, the effective sample size will be higher than the number of studies.
Furthermore, studies rarely report uniformly on studylevel covariates, and when reported, study-level covariate adjustment might be prone to ecological bias if any of the covariates is an average estimate of a patient-level characteristic such as mean age. 55,56 Including multiple covariates makes the interpretation of the raw model coefficients (the logits and odds ratios) less transparent. This can lead to incorrect interpretation of the results among meta-analysts unfamiliar with the underlying principles in multiple regression. To make interpretation easier, metadta reports the population-averaged estimates of absolute sensitivity and specificity and/or average marginal relative sensitivity and specificity. These marginal estimates are more relevant to public health policy. Other concerns regarding metaregression are data dredging and inflated Type I error rate. 57 Nevertheless, this feature is particularly valuable in obtaining smoothed estimates of absolute or relative sensitivity and specificity in different patient populations, clinical settings, etc. The approach is statistically sound, far more robust and allows drawing coherent conclusions than the widely used approach, multiple subgroup analyses. To safely exploit this feature, the meta-analysis should be carefully designed, and the review authors should involve or consult a statistician familiar with advanced methods for meta-analysis.
Future research will focus on sensitivity analysis to identify potential outlying studies and produce more meaningful reports especially from the NMA. A comparison of metadta with comparable procedures for DTA-NMA is also foreseen.
Researchers interested in the meta-analysis of DTA and convenors of the Screening and Diagnostic Tests Methods Group are calling for general and robust methods that fully utilize all the available. 58,59 The merits of metadta are efficient operation with minimal user intervention, flexibility to deal with complex data sets and better presentation of the results in rich reports and graphics. We expect its availability to stimulate better practice in the meta-analysis, network meta-analysis, and meta-regression of DTA.