Quantifying how diagnostic test accuracy depends on threshold in a meta‐analysis

Tests for disease often produce a continuous measure, such as the concentration of some biomarker in a blood sample. In clinical practice, a threshold C is selected such that results, say, greater than C are declared positive and those less than C negative. Measures of test accuracy such as sensitivity and specificity depend crucially on C, and the optimal value of this threshold is usually a key question for clinical practice. Standard methods for meta‐analysis of test accuracy (i) do not provide summary estimates of accuracy at each threshold, precluding selection of the optimal threshold, and furthermore, (ii) do not make use of all available data. We describe a multinomial meta‐analysis model that can take any number of pairs of sensitivity and specificity from each study and explicitly quantifies how accuracy depends on C. Our model assumes that some prespecified or Box‐Cox transformation of test results in the diseased and disease‐free populations has a logistic distribution. The Box‐Cox transformation parameter can be estimated from the data, allowing for a flexible range of underlying distributions. We parameterise in terms of the means and scale parameters of the two logistic distributions. In addition to credible intervals for the pooled sensitivity and specificity across all thresholds, we produce prediction intervals, allowing for between‐study heterogeneity in all parameters. We demonstrate the model using two case study meta‐analyses, examining the accuracy of tests for acute heart failure and preeclampsia. We show how the model can be extended to explore reasons for heterogeneity using study‐level covariates.

synthesise only a single estimate of sensitivity and specificity from each study, despite studies very often reporting estimates at multiple thresholds. 4 The presence of these additional data is widely regarded as problematic, due to the additional complexities in data synthesis. However, within-study information on how test accuracy varies with threshold could clearly be extremely valuable, both for quantifying the average sensitivity and specificity across all thresholds and for disentangling heterogeneity due to varying thresholds from that due to other factors.
A simple approach to addressing both problems is to perform a separate meta-analysis of the data at each threshold or for groups of similar thresholds. The Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy notes that "Each study can contribute to one or more analyses depending on what thresholds it reports". 5 This will produce summary estimates of the sensitivity and specificity of the test at each threshold in the data set. However, case studies have demonstrated bias in these estimates if some studies only report accuracy measures at data-driven "optimal" thresholds. [6][7][8] Confidence or credible intervals will also be very wide for thresholds with limited data. It might be possible to address these problems through imputation of missing data in each study prior to meta-analysis, 7,8 although accounting for uncertainty in these imputations requires extra steps. 8 An additional problem also remains: if higher values increase the likelihood of a disease, by definition, sensitivity must reduce and specificity increase with increasing threshold. However, these relationships will not necessarily hold in the summary estimates.
The alternative is a single unified analysis of all available data. This will enable "borrowing of strength" across thresholds and produce estimates that conform to the known relationship between threshold, sensitivity, and specificity. Various models have been proposed for such a unified analysis. Some of these produce a summary ROC curve but not estimates of sensitivity and specificity relating to specific thresholds. [9][10][11] Others were devised for synthesis of ordinal test results with a small number of categories. [12][13][14] Extensions of these models for truly continuous test results would require very large numbers of parameters to be estimated. For example, Riley et al proposed a 2*K dimensional multivariate normal model, where K is the total number of distinct thresholds included in the meta-analysis, but noted that this will often not be estimable. 15 An alternative approach is to explicitly model sensitivity and specificity as functions of threshold. 4,15 However, there is a lack of clarity on which function of threshold is the most appropriate. Steinhauser et al 4 16 proposed unified models that assume that test results in the diseased and disease-free populations have a prespecified distributional form, for example, log-normal. A criticism is that the appropriate choice of distribution might not be known in advance, however. 8 Hoyer et al note that it would be a "definite advantage" if the distributions could be "estimated simultaneously together with the other model parameters". 16 We present a new model that, compared with previous approaches, has most in common with but potential advantages over that suggested by Steinhauser et al. 4 We model the exact multinomial likelihoods of the spread of test results across categories defined by thresholds, rather than requiring the normal approximations used by Steinhauser et al. 4 This approach automatically accounts for within-study correlations resulting from studies reporting at more than one threshold and should perform better with small counts. 2 We also relax the assumption that the appropriate distributional form is known, assuming only that some Box-Cox transformation of test results in the two populations has a logistic distribution. The Box-Cox transformation parameter can be estimated from the data. Our model is parameterised directly in terms of the means and scale parameters of these two logistic distributions, and is easily extended to allow for study-level covariates impacting upon any of these four parameters.

and Hoyer et al
We first describe the model in Section 2, including the extended version to include study-level covariates. In Section 3, we describe two case study data sets, to which we then apply the model in Section 4, before concluding in Section 5.

A GENERAL FLEXIBLE MODEL STRUCTURE
We describe a flexible model that is straightforward to fit in a Bayesian framework using Markov chain Monte Carlo (MCMC) simulation software, such as WinBUGS 17 or OpenBUGS.

Notation and within-study model
We consider the case where each study, i, reports estimates of sensitivity and specificity at T i distinct thresholds; or, equivalently, directly reports count data in a form such as Table 1. T i might be equal to one in some studies, ie, it is not required that all studies contribute more than one pair of data points. We will assume throughout that the true disease state is known for all individuals, through application of some gold standard test. We denote the total number of individuals without and with the disease in study i by N i1 and N i2 , respectively.

TABLE 1
Test accuracy data from a study, indexed i, providing estimates of sensitivity and specificity at T i distinct

Population Total number of patients Number with test result
We assume that higher values of the continuous test result are associated with increased likelihood of disease, such that a "positive" test result is one that falls above a given threshold. At each threshold C it , t = 1, … ,T i , we denote the number of false positive and true positive individuals by x i1t and x i2t , respectively (Table 1). These counts must, by definition, be monotonically decreasing with t, a property which the model should reflect.
In each study, we can subdivide each patient population (N i1 and N i2 ) into (T i + 1) mutually exclusive groups, with test results falling below C i1 , between C it and C i,t + 1 (t = 1, … ,T i -1), and above C i,T i . The distribution of each of the two sets of results across these groups is multinomial. Conditional on the underlying probability parameters, the two multinomial distributions for each study i are independent of each other.
For model fitting purposes, it is convenient to use the binomial factorisation of these multinomial distributions, 18 in which they are written as a series of conditionally independent binomial distributions, ie, for population j = 1 (disease-free) and j = 2 (diseased): where pr i1t and pr i2t are the false positive rate ( fpr) (= 1-specificity) and true positive rate (tpr) (= sensitivity) at threshold C it in study i. By definition, pr i1t and pr i2t monotonically decrease with increasing t and lie in [0,1], such that each Binomial probability parameter is unconstrained within the interval [0,1]. This parameterisation obviates the need to reexpress the Table 1 data as numbers of patients falling between each threshold value, and allows the same model code to be applied to studies with binomial (T i = 1) and multinomial (T i > 1) likelihoods. We wish to specify the tprs and fprs as functions of threshold. The appropriate function depends on the distribution of continuous test results in the disease-free and diseased populations. We will assume that there exists some monotonic transformation, g(), that transforms test results in each of the two populations to either a normal or logistic distribution. This is the most common assumption made in the fitting of a smooth line to an empirical ROC curve. 19 We will work on the basis of logistic distributions throughout, which are similar to normal and more computationally convenient (leading to logit rather than probit link functions). We assume for now that the same g() applies to both distributions and to all studies in the meta-analysis, but will discuss relaxing this assumption later (Section 5).
Let us denote the continuous test results of disease-free and diseased individuals in the ith study by y i1k and y i2k respectively, where k is an index for individual. We assume that where ij and ij denote mean and scale parameters for disease status group j. As g() is monotonic, pr ijt ≡ Pr(y ijk ≥ C it ) = Pr (g(y ijk ) ≥ g(C it )). It follows from the cumulative distribution function of the logistic distribution that the fprs and tprs at a threshold of C it in study i are defined as follows:

Choice of transformation function, g()
We see from Equation (1) that to explicitly model the dependence of pr ijt on C it , we need to move beyond the "semiparametric" or "distribution-free" approach often used to fit smooth ROC curves, 19 in which the transformation function g() remains unspecified. A fully parametric model is required. In particular, specifying logit (pr i1t ) and logit (pr i2t ) as linear functions of untransformed C it (eg, the work of Riley et al 15 ) would implicitly assume that g() is the identity function, that is, that the test results in the diseased and disease-free populations have symmetric, logistic distributions. We might be comfortable to prespecify an appropriate transformation, g(). This could be informed by inspection of the distributions of test results from a laboratory or from one or more study publications. Often in practice, assuming g() is the natural logarithm (subsequently referred to simply as "log()") will be a reasonable approximation for positive valued test results, which are often right skewed. 20 This corresponds to assuming a log-logistic distribution for each set of test results, one of the distributional forms considered by both Steinhauser et al 4 and Hoyer et al. 16 If, however, the analyst is not confident about the most appropriate transformation or would like to assess sensitivity of results to this assumption, we propose using a more flexible approach. We assume only that g() is one of the set of Box-Cox transformations, defined by This reduces to the assumption of logistic distributions of underlying test results when = 1. As decreases from 1, this indicates an increasing degree of right skew of the underlying distributions, with = 0 corresponding to log-logistic distributions.
The transformation parameter can be estimated with uncertainty from the data. This approach was proposed by Zou and Hall 21 for the estimation of ROC curves in a single study but to our knowledge has not been previously applied in a meta-analysis setting.

Between-study model
We assume that, across studies, ij is normally distributed with mean m j and variance 2 , for each population j = 1,2. Similarly, log( ij ) is assumed to be normally distributed with mean m j and variance 2 .
We would generally anticipate some correlations across these four sets of random effects. Any between-study correlation structure might be specified. Here, we describe three, each of which we will apply to our case study data sets in Section 4. (

i) Full correlation matrix
To allow for all possible between-study correlations, we can fit a full quadrivariate normal distribution with six different correlation parameters: In WinBUGS, this can be fitted using a product normal formulation, which we describe in Appendix.
(ii) Structured correlation matrix As correlation parameters in multivariate meta-analysis models can be difficult to estimate, it is desirable to reduce the number of these to be estimated by prespecifying a realistic correlation structure.
One simplifying set of assumptions might be that all correlations arise through dependencies between the following three pairs of parameters: i1 and i2 ; i1 and log( i1 ); i2 and log( i2 ). In general, we might expect i1 and i2 to be positively correlated across studies. For example, study-level factors might raise or lower the expected test result in both the diseased and disease-free populations. We will denote this correlation by (as in (2)). Study-specific log-scale parameters might also be expected to be positively correlated with means in the same patient group. We will assume that this correlation is the same in the diseased and disease-free populations and will denote it by . We hypothesise that any other correlations between random effects (for example, between i1 and log( i2 )) are likely to be induced through and . The corresponding quadrivariate normal distribution can be written as four conditionally independent univariate distributions as follows: (iii) Independence For completeness, we include a model with four independent sets of random effects, ie, with all six correlation parameters in (2) equal to 0: , and

Inclusion of study-level covariates
As in any meta-analysis, potential reasons for heterogeneity across studies in diagnostic test accuracy should be explored where possible, rather than simply accommodated using random effects. It is straightforward to extend our model to include study-level covariates, acting on either the location ij and/or log-scale log( ij ) parameters. Furthermore, while it can be difficult to hypothesise how sensitivity and specificity might vary according to study characteristics, it seems natural to consider how the "average" test result or the spread of test results in either population might be affected. A generalised version of the "full" model (Equation (2)) is as follows, where z ir are vectors of study-level covariates and r are vectors of meta-regression coefficients to be estimated, r = 1, … ,4: Special cases include z ir = z i , r = 1, … ,4, whereby the same set of study-level covariates is hypothesised to be associated with all four sets of random effects, and z ir = 0 for some r, whereby we hypothesise associations of covariates with only a subset of the random effects.

CASE STUDY DATA SETS
We now describe two case study data sets, before fitting our model to each of these in Section 4.

Example 1: Brain natriuretic peptide (BNP) for diagnosis of acute heart failure
Roberts et al 22 performed a systematic review of the accuracy of brain natriuretic peptide (BNP) in diagnosing acute heart failure in adults presenting in an acute care setting with dyspnoea. The authors extracted measures of the accuracy of BNP, relative to a reference standard of retrospective review or final hospital diagnosis, from 26 studies of consecutive or randomly selected patients. 22 Many of these studies reported sensitivity and specificity at more than one threshold.
By checking each of the original study publications, we found that additional data were often available. In some studies, these were not displayed in tables but were, however, shown on ROC plots, on which the thresholds corresponding to particular points on the curve had been marked. We extracted sensitivity and specificity estimates from these plots using the DigitizeIt software (http://www.digitizeit.de/).
It is recognised that a given BNP measurement on one assay might not translate directly to the same value on other assays. 23 For this reason, we restrict our analyses to the 18 studies that assessed the accuracy of the Triage assay (Biosite Inc, San Diego). In total, the final data set consisted of 66 pairs of sensitivity and specificity from these 18 studies, ranging from a single pair (four studies) to seven (three studies). The data are displayed in two formats in Figure 1: on the ROC plane on the left, whereas on the right, we display how the probability of a positive test result for each of the two groups of patients depends on the threshold used. The data are available from figshare.
As noted above, a common approach to synthesising data with multiple thresholds, applied by authors including Roberts et al, 22 is to group the data into categories with similar thresholds and perform a number of stratified analyses. To demonstrate this approach, we rounded all thresholds to the nearest 50 and performed stratified analyses: for each threshold with at least four contributing studies, we fitted the standard bivariate meta-analysis model 1,2 in WinBUGS. 17 Summary results are shown on Figure 1. We see that these stratified analyses produce estimates of the tpr and fpr that do not reduce monotonically with increasing threshold. This problem is masked in the ROC plot ( Figure 1  There are several potential factors that might influence the accuracy of BNP as a test for acute heart failure. For example, Rogers et al 24 found age, gender, ethnicity, body mass index, blood urine nitrogen, and creatinine to all be associated with BNP levels independently of heart failure. We extracted the average age of patients in each study to explore whether the accuracy of the test varied by this factor (values available in the figshare file).

Example 2: Spot protein to creatinine ratio (PCR) for diagnosis of preeclampsia
Morris et al 25 systematically reviewed the literature for studies assessing the diagnostic accuracy of spot urinary protein to creatinine ratio (PCR) in detecting significant proteinuria in pregnant women with suspected preeclampsia. Significant proteinuria is defined as ≥0.3 g/24 hours, the "gold standard" test for which is a 24-hour urine collection. Data were extracted from 13 studies, each of which reported sensitivity and specificity estimates at between one (five studies) and nine (one study) thresholds. 25 The data are displayed in Figure 2 and are available in full from Morris et al. 25 Riley et al 15 previously analysed this data set using a multivariate normal meta-regression approach, in which logit (sensitivity) and logit (specificity) were modelled as polynomial functions of threshold, C. They found a cubic relationship with threshold to fit the data best. We show the summary estimates from their analysis on Figure 2 (point estimates and 95% CIs extracted from table 4 of Riley et al 15 ). We see that the implied summary ROC curve is not concave and does not seem to fully capture the relationship between the tprs and fprs and threshold.

APPLICATION TO CASE STUDY DATA SETS
We now fit our proposed model to each of the two case study data sets using WinBUGS. 17 We begin by fitting models with no study-level covariates, then also explore whether heterogeneity can be explained by average patient age in the BNP data set. We gave Normal (0,10 2 ) prior distributions to all means (m j ,m j ) and meta-regression coefficients, Uniform (0,5) distributions to between-study standard deviations ( j , j ), and Uniform(−1,1) prior distributions to any between-study correlation parameters. We will compare results from models in which g() is prespecified with models in which the transformation function is estimated from the data (Section 2.2). For the latter, we assigned a Uniform(−3, 3) prior to .  We also performed sensitivity analyses with a Uniform(1,10) prior distribution for , as used previously in analyses by O'Malley and Zou. 26 WinBUGS code is available from figshare. Example 1: B-type natriuretic peptide for diagnosis of acute heart failure As many of the articles in this systematic review noted that BNP values are right skewed and used the natural logarithm of BNP values in their own analyses, eg, Karmpaliotis et al, 27 Dokainish et al, 28 and Davis et al, 29 we first assumed g = log() and fitted the three between-study models described in Section 2.3. Model fit penalising for complexity was compared using the Deviance Information Criterion (DIC). 30 Models with lower values of the DIC are preferred.
As shown in Table 2, differences in DIC across the three correlation structures (Models 1-3) were minimal. Notably, this is despite strong evidence of a positive correlation between the i1 and μ i2 (Model 2 estimate of = 0.80, 95% credible interval, Cr-I, 0.24 to 0.99). In the absence of any reduction in DIC from modelling between-study correlations, arguably the independence model is preferred for this data set.
We then extended the independence model (Model 3) to estimate the best fitting Box-Cox transformation parameter , rather than assuming g = log(). was estimated to be 0.23 (95% Cr-I 0.10, 0.34), indicating that the underlying distributions of test results are slightly less right-skewed than log-logistic (λ = 0). This estimate was not sensitive to the choice of prior. As shown in Table 2 (Model 4), this model fitted the data marginally better as measured by the mean residual deviance, but with a minimal reduction in DIC (1.7 points).
Summary fprs and tprs for each model were calculated by evaluating Equation (1) at the means of the four sets of random effects, ie, for any threshold C: As shown on Figure 3, these were generally reassuringly similar across models, particularly across the range of thresholds encompassing most of the data: between thresholds of 100 and 500, the maximum absolute difference in summary tpr and fpr estimates across models was 1% and 2%, respectively. Model 4 provided substantially lower summary estimates of the tpr at very high thresholds (>5% absolute difference for thresholds above 780), where the data are very sparse. Compared with the stratified meta-analyses of data at similar thresholds (Figure 1), the estimates of tpr and fpr are seen to be coherent (reducing as threshold increases) and more precise. By drawing predictions for a "new" study population from each set of random effects, and calculating the tpr and fpr at these predicted values, we also generated 95% prediction intervals. 31 For Model 3, Figure 3 shows prediction intervals in addition to Cr-Is for summary estimates. The very wide prediction intervals, especially for the fpr at lower thresholds, illustrate that there is a large amount of between-study heterogeneity that is not explained by variation in thresholds.
For comparison, we fitted the model proposed by Steinhauser et al 4 to the same data set, using the "diagmeta" R package. 32 A comparison of results with our "Model 3" is provided in the supplementary material. Summary fpr estimates with 95% CIs were very similar to estimates and Cr-Is from our Model 3. Across thresholds, diagmeta estimated a slightly higher (up to 3%) summary tpr than our model. See Discussion for possible explanations.
In an additional analysis (Model 5), we explored whether any of the between-study heterogeneity could be explained by average patient age. Several studies have noted that BNP tends to increase with age. 33,34 We fitted an extended version of Model 3 to assess whether average patient age was associated with the study-level location parameters, i1 and i2 . Specifically, in Equation (3), we set z i1 = z i2 = (centered) average patient age and z i3 = z i4 = 0. As we built upon Model 3, all correlation parameters in Equation (3) were also set to zero. Among patients without acute heart failure, the model estimated that a 5-year increase in average patient age was associated with a 15% increase in mean BNP, but the statistical evidence for this finding was weak (ratio = 1.15, 95% Cr-I 0.90 to 1.51). As shown on Figure 4, this estimated dependence of i1 on age drives higher estimates of the fpr in older populations. There was no evidence that BNP levels varied with average patient age among patients with acute heart failure (ratio of means = 1.05, 95% Cr-I 0.90 to 1.22). Unsurprisingly, given the weak evidence for any association, the model did not lead to any improvement in DIC (Table 2).

Example 2: Spot PCR for diagnosis of preeclampsia
As papers included in this systematic review indicated right skew in values of PCR, 35,36 we followed the same analysis strategy as for Example 1, ie, first, assuming g = log(). As shown in Table 3, again, the DIC did not provide support for including parameters for between-study correlations. An extension of the independence model to estimate the most appropriate Box-Cox transformation parameter (Model 4) provided an estimate of = −0.54 (95% Cr-I -0.99, −0.10), indicating that the underlying distributions of test results are slightly more right skewed than log-logistic. However, this extension to the model was not supported by the DIC, which increased by 4.1 points relative to Model 3. Figure 5 shows that the summary estimates from all four models were very similar across the entire range of thresholds. The maximum absolute difference in tpr was 2% (Model 4 versus 3 at the highest thresholds). Summary fpr estimates differed by a maximum of 3% across Models 1-3 but up to 5% for Model 4 versus the others. These discrepancies, as seen     Figure 5, were at the lowest threshold values. In contrast, our summary estimates are markedly different from the best fitting model of Riley et al 15 (as shown on Figure 2): mean absolute difference in summary tpr = 4% (maximum 8%), mean absolute difference in summary fpr = 10% (maximum 22%), compared with our Model 3. Our models suggest a much greater dependency of tpr and fpr on threshold: for example, across the full range of thresholds, summary fprs from the model of Riley et al reduced from 0.30 to 0.02, whereas those from our Model 3 reduced from 0.52 to 0.02. This appears to better capture the range in the observed data. Prediction intervals are again seen to be extremely wide, reflecting a large amount of between-study heterogeneity in tpr and fpr even at the same threshold.
See the supplementay material for a comparison of Model 3 results with results from fitting the model proposed by Steinhauser et al. 4 Across thresholds, diagmeta consistently estimated a slightly higher summary tpr than our Model 3 (median difference = 3%). The Steinhauser model estimated a steeper gradient for the dependence of summary fpr on threshold than our model (maximum absolute difference in fpr = 9%). 95% CIs around summary estimates from diagmeta were quite different from our 95% Cr-Is, in particular being narrower at lower thresholds and much wider at higher thresholds for the summary tpr. See Discussion for possible explanations.

DISCUSSION
Since the most appropriate threshold at which to operate a test is usually a key clinical question, there is a need to move beyond "standard" meta-analysis methods 1-3 to explicitly quantify how sensitivity and specificity vary across thresholds.
Perhaps the most obvious approach would be to simply regress logit transformed tprs and fprs (or equivalently, sensitivity and specificity) on C. However, we see from Equation (1) that this would imply strong assumptions about the underlying test results: (i) that these have symmetric, logistic distributions; (ii) if assuming constant slope parameters (as is the case in a standard meta-regression 37 ) that the scale parameters of these logistic distributions are constant across studies, which seems unlikely in practice. We have described a model that allows for a range of skewed or symmetric distributions of test results and estimates study-specific location and scale parameters.
Riley et al 15 proposed a multivariate normal meta-regression approach, in which logit (sensitivity) and logit (specificity) are modelled as polynomial functions of threshold. For the PCR data set (our Example 2), they found a cubic relationship with threshold to have the best fit. However, if the common assumption holds that there is a monotonic transformation, g(), that transforms test results in both the diseased and disease-free populations to logistic, then it follows that logit (sensitivity) and logit (specificity) are in fact linear functions of g(C).
Other approaches have been suggested, which are based on specific assumptions about the underlying distributions of test results. 4,16 Of these, our proposed model is the most similar to that proposed by Steinhauser et al, who assume that test results have either logistic, log-logistic, normal, or log-normal distributions (depending on the choice of link function, logit or probit, and the choice of covariate, C or log(C)). 4 Our model allows for a more flexible range of underlying distributions through its ability to estimate a Box-Cox transformation parameter within the model. Alternatively, a value of can be prespecified based on knowledge of the specific test. We note that although was well estimated in each of our two case studies, computation time was increased (relative to prespecifying = 0, ie, g() = log()), due to high autocorrelations in simulated values of this parameter.
An implicit assumption of our model is that the same transformation function, g(), applies to all studies in the meta-analysis. This seems defensible if all studies assessed the same continuous outcome and that this outcome was measured in the same way across all studies. This might not be the case if values are not directly comparable across assays or machines made by different manufacturers. If this is known to be the case, it might be sensible to perform separate analyses for each assay or machine (see Example 1, where we restricted the meta-analysis to the 18 studies reporting data relating to the same assay).
Our model could also be extended to allow to vary randomly across studies (similar to the work of O'Malley and Zou 26 ) or to estimate a separate transformation parameter for the diseased and disease-free populations. The latter extension violates the usual assumption made in estimating ROC curves that there is a linear relationship between the tprs and fprs on the logit scale. However, this assumption could be too restrictive, as noted by Putter et al. 14 For our two case study data sets, neither of these extensions materially impacted on the estimates (not shown).
In addition to increased flexibility in distributional form, our model differs in other ways from that proposed by Steinhauser et al. 4 It is therefore not surprising that we observed some differences in summary estimates and intervals for both worked examples, even when prespecifying g() = log() (Supplementay material). Firstly, Steinhauser et al made normal approximations to the true multinomial likelihoods of the count data to apply standard linear mixed modelling techniques. 4 In contrast, we modelled the multinomial likelihoods directly. This automatically accounts for within-study correlations resulting from a study reporting accuracy measures at more than one threshold and should perform better at thresholds where the number of positive test results is equal to or close to zero.
Likely the primary driver of the differences in these summary estimates, however, is that our between-study model is quite different from that of Steinhauser et al. 4 We parameterised our model in terms of the means and scale parameters of transformed test results in the diseased and disease-free populations. We assumed that the means and log-transformed scale (log( ij )) parameters are normally distributed across studies. In contrast, Steinhauser et al specified logit (pr i1t ) and logit (pr i2t ) as linear functions of C it or log (C it ) and assumed that the intercept and slope parameters were normally distributed across studies. 4 Note that, by definition, these slope parameters are equal to −1/ ij (see Equation (1)). As such, the two models make different assumptions about the nature of the between-study variation. As the sets of random effects in the different models are not linear transformations of each other, the summary estimates and amount of uncertainty around these may differ (as in the supplementay material). We do not feel that general conclusions on the pattern of differences between the two models can be made based on only two case studies. However, an argument in favour of our parameterisation or between-studies model is that it automatically constrains all scale parameters, ij , to be positive. Steinhauser et al noted in their simulation study that occasionally, their parameterisation leads to estimation of impossible positive slope parameters (equivalently, negative ij ). 4 An alternative would be to extend the "hierarchical summary ROC" parameterisation that is often used for meta-analysis of diagnostic test accuracy. 3 An extension of this to model multiple thresholds 9 specifies pr i1t and pr i2t as functions of two sets of random effects (study-specific "accuracy" and "shape" parameters) and a number of "threshold" parameters, denoted by it . The threshold parameters are constrained to be ordered within each study, but are estimated independently of the it s in other studies. To extend this model to the case of explicit numerical thresholds, one could instead specify it as a linear function of g(C it ), with the intercepts and natural logarithms of the slope parameters being two additional sets of random effects.
The parameterisation in this paper may be the most natural for investigating reasons for between-study heterogeneity. For example, given that BNP levels have been found to increase with patient age within studies, 33,34 it was intuitive (in the absence of individual level data) to fit average patient age as a covariate acting directly on the location parameters, ij . In other data sets, we might hypothesise that a covariate is more likely to drive differences in the spread of test scores, represented by the scale parameters. Heterogeneity in fpr or tpr across studies could be driven by differences in either of these sets of parameters. Results of analyses with covariates should be interpreted with the caution advised for any meta-regression, given likely low statistical power, risk of chance findings and, when modelling the effect of average population characteristics (such as in our worked example), potential ecological bias. 37 In our two case studies, we found the majority of summary estimates to be reassuringly similar across variations of our model. The BNP analyses illustrate that we should be cautious, however, in interpreting estimates at extreme threshold values with little data. We found no improvement in model fit and very little impact on estimates of target parameters by estimating between-study correlation parameters relative to estimating each set of random effects separately. This is probably because there is a very little information to inform the between-study correlations. This will not necessarily be the case in data sets that include large numbers of studies on large numbers of patients. We also note that our "structured" correlation matrix is one of many possible structures that might be hypothesised and fitted, depending on knowledge of the test and data.
Following estimation of "summary" sensitivity and specificity across all thresholds, any one of a number of criteria might be applied to decide upon the optimal threshold. A very simple approach would be to maximise the Youden Index, defined as the sensitivity + specificity -1. 38 However, it will not often be appropriate to weight sensitivity and specificity equally in this way. For example, the potential role of natriuretic peptides in an acute care setting is as a "rule out" test: in this context, high sensitivity will generally be considered more important than high specificity. 22 See Figure 5 of the work of Steinhauser et al 4 for a demonstration of how weighting the Youden index in favour of sensitivity reduces the "optimal threshold" for BNP testing. One alternative simple approach might be to maximise the sensitivity for a prespecified maximum acceptable fpr. More sophisticated approaches explicitly account for the prevalence of the disease in the decision population and the costs and anticipated consequences (good and bad) of all four possible outcomes of a test: true positive, false positive, true negative, false negative. Given an economic decision model, the optimal threshold can be selected to maximise the expected net benefit. 39 We emphasise the importance of utilising multiple pairs of sensitivity and specificity from studies in a meta-analysis, where available. Even if these are not stated in the text or tables, it will often be possible to extract additional data from ROC curves using digitized software. These valuable additional data allow for a very flexible modelling approach.
Hospitals Bristol NHS Foundation Trust and the University of Bristol. The views expressed in this publication are those of the author(s) and not necessarily those of the NHS, the National Institute for Health Research, or the Department of Health.

DATA AVAILABILITY STATEMENT
Data and WinBUGS code for one variation of the model are openly available from figshare. WinBUGS code for other variations of the model will be provided by the first author on request.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.

APPENDIX PRODUCT NORMAL FORMULATION FOR MULTIVARIATE NORMAL RANDOM EFFECTS
We present a general approach to fitting a multivariate normal distribution of dimension N in Bayesian statistical software using MCMC simulation, such as WinBUGS or OpenBUGS. Consider the following multivariate normal distribution: This can be computed more efficiently by rewriting in the equivalent form (A1) We could estimate this model by assigning a vague Wishart prior to the precision matrix, but results can be sensitive to the choice of Wishart parameters. 40 It seems preferable instead to give prior distributions directly to interpretable parameters, such as standard deviation and correlation parameters. This also allows direct comparison of results from our "full correlation matrix" and "structured correlation matrix" models in the main manuscript or any alternative forms of the correlation matrix that might be considered appropriate in a given setting. However, because the multivariate normal distribution is specified in terms of the precision matrix, it is necessary to invert the covariance matrix at each iteration of the MCMC sampler. Furthermore, at the early stages of the sampler, we may generate covariance matrices that are near singular, and lead to numerical errors when inverted.
We avoid these problems by writing the multivariate normal distribution as a sequence of conditional univariate normal distributions 40 and deriving the precision matrix iteratively without requiring a numerical inverse procedure.
Let us further denote W n = Σ −1 n , n = 1, … ,N. Say that the i s are partitioned into two subsets: Δ Ai of dimension p and Δ Bi of dimension N-p. Then, the multivariate normal distribution (A1) can be written as a product of two independent multivariate normal distributions of dimensions p and N-p, respectively: the marginal distribution of Δ Ai , which has mean 0 and covariance matrix simply equal to the relevant partition of Σ N , and the conditional distribution of Δ Bi given Δ Ai . 41 Consider the special case where Δ Bi = iN , ie, p = N-1. Then, (A1) can be written as a product of and the following univariate normal distribution for iN conditional on (i,N − 1) = ( i1 , … , iN − 1 ) T : . As this formula can be applied iteratively, (A1) can be written as a product of conditional univariate normal distributions as follows: where F n − 1 is the Schur complement of Σ n − 1 , 42 defined as Clearly, W 1 = 1/V 11 . For n = 2, … ,N, we can find W n iteratively, as a function of W n − 1 , as follows 42(p80) : Hence, the matrix W n is defined by the elements: From (A2), we see that (A3) is equivalent to (W n ) rs = (W n ) sr = (W n−1 ) rs + F n−1 (W n ) rn (W n ) sn .