Meta-analysis of full ROC curves with flexible parametric distributions of diagnostic test values

Diagnostic accuracy studies often evaluate diagnostic tests at several threshold values, aiming to make recommendations on optimal thresholds for use in practice. Methods for meta-analysis of full receiver operating characteristic (ROC) curves have been proposed but still have deficiencies. We recently proposed a parametric approach that is based on bivariate time-to-event models for interval-censored data to this task. To increase the flexibility of that approach, to cover a wide range of distributions of diagnostic test values and to address the open point of model selection, we here suggest to use the generalized F family of distributions that includes previously used distributions for the bivariate time-to-event model as special cases. The results of a simulation study are given as well as an illustration by an example of population-based screening for type 2 diabetes mellitus.


| INTRODUCTION
In medical research, diagnostic accuracy studies often evaluate diagnostic tests at several threshold values, aiming to make recommendations on optimal thresholds for use in practice. In the past, methods for the metaanalysis of diagnostic studies frequently focused on summarizing only a selected single pair of sensitivity and specificity from each study that leads to a waste of the majority of observations. Methods for the meta-analysis of full receiver operating characteristic (ROC) curves have been proposed; 1-10 however, they come with several disadvantages. 11 These are, for example, the need for an identical number of thresholds across single studies, ignoring concrete threshold values or applying two-step approaches that partly neglect estimation uncertainty from the first step in the second step. As an alternative and to overcome these disadvantages, we proposed an approach based on bivariate time-to-event models for interval-censored data. 11 Thereby, we assume that the underlying diagnostic test values in the population of diseased and non-diseased follow parametric distributions, such as the Weibull, log-normal, or log-logistic distribution. While a wide variety of distributions is available, we also noted that it is not possible to choose among them using well-defined model selection criteria.
In this article, we extend our previous model 11 by referring to the family of generalized F (GF) distributions, which is governed by four parameters and includes all of our previously used distributions as special cases. Extending this basic set of distributions by one additional parameter yields the generalized log-logistic, Burr III, and Burr XII distributions as further flexible special cases of the GF distribution. As a consequence, we are now able to apply model selection criteria as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) to choose the most appropriate model because all investigated distributions are members of the GF family. In applications, individual participant data (i.e., the concrete test values of each study participant) are rarely available, and it is thus difficult to foresee the true underlying distributions of the diagnostic test values. As such, the flexibility of the extended model is necessary to cover a wide range of possible distributions.
The article is organized as follows: First, we introduce our motivating example data set in Section 2. Afterwards, we describe the used statistical methods in Section 3, which are compared in a simulation study as presented in Section 4. In Section 5, the results for the example data set are presented. Finally, in Section 6, we conclude with a critical discussion of our approach.

| DATA SET
Our work is motivated by two existing systematic reviews on population-based screening for type 2 diabetes mellitus. 12,13 Both of them investigate the diagnostic performance of glycated haemoglobin A1c (HbA 1c , measured in %) as a marker for type 2 diabetes. Contrary to existing tests for diagnosing type 2 diabetes as fasting plasma glucose (FPG) and the oral glucose tolerance test (OGTT), which are based on glucose measurement, HbA 1c is increasingly preferred because of several advantages. For example, there is no need for fasting, less biologic variability, and less preanalytic instability of HbA 1c.
14 The American Diabetes Association (ADA) 15 and the World Health Organization (WHO) 16 recommend 6.5% as the optimal threshold value for declaring a test positive. Both systematic reviews come with two drawbacks: First, they do not report on meta-analytic summary measures as, for example, sensitivity and specificity, and second, the authors selected only one threshold per single study; although, in most cases, at least two are reported. Especially, the second point is of importance because it indicates that many observations are lost. Therefore, we screened all single studies again, leading to in total 38 studies reporting on 124 pairs of sensitivity and specificity at 26 different diagnostic thresholds. A metaanalysis based on only one selected threshold per single study would therefore discard more than 70% of the available observations. Table 1 shows exemplarily for the first two studies how the data set looks like. The complete data set was also used in Hoyer et al 11 where it is given in the Supporting Information. Alternatively, it can be requested from the authors.

| STATISTICAL METHODS
Hereinafter, we briefly introduce the already published bivariate time-to-event model for estimating summary receiver operating characteristic (SROC) curves. 11 Afterwards, an extension of that model using the generalized F family of distributions is presented.

| Bivariate time-to-event model
Our recent approach for the meta-analysis of full ROC curves is based on a bivariate time-to-event model for interval-censored data. 11 T A B L E 1 First two studies reported in the systematic reviews of Bennett et al 12  The central assumption is that the diagnostic test values are interval-censored as we only know if (and how many of) these lie above or below the given thresholds of the respective study. As we are interested in sensitivity and specificity, we have to keep diseased and non-diseased participants apart and finally arrive at the close correspondence between an ROC curve and a bivariate time-to-event model for interval-censored data. For modeling diagnostic test values in the two populations of diseased and non-diseased, we used three different distributions, which are in the following given for the population of diseased (D + ) and are defined analogously for the non-diseased (D − ) 11 : • the Weibull distribution with density • the log-normal distribution with density where φ denotes the probability density function (pdf) of the normal distribution, and • the log-logistic distribution with density where μ D + and ϕ D + represent location and scale parameters, respectively, and y D + the diagnostic test values in the population of diseased. Explicitly naming the analogy of our model to the class of time-to-event models, we consider the diagnostic test values as the "time" scale. Sensitivity is then the event probability in the population of diseased (or 1-specificity in the population of non-diseased), which can be interpreted, with the diagnostic test values on the x-axis, as a cohort life-table estimator where sensitivity changes (in general, decreases) with increasing thresholds. 11 In other words, each individual study participant contributes three pieces of information: its true disease status, the interval in which its diagnostic test value lies (and which is measured on our underlying "time" scale), and finally, the study from which it originates.
As we have to include random intercepts into the linear predictors to account for correlated test values within studies, and as it is not immediately obvious where to add them, we log-transform our outcome, that is, the diagnostic test values. After log-transformation, we are in context of accelerated failure time (AFT) models, which are defined by a unified linear predictor corresponding to our assumed distributions where random effects can unequivocally be added. 17 The model equation then reads as with where y iD + and y iD − denote the diagnostic test values in the population of diseased and non-diseased of the ith study, respectively. The study-specific (indexed by i, i ∈ 1, …, N) random effects u iD + and u iD − with their corresponding variances σ 2 D + and σ 2 D − and correlation parameter ρ are used to model potential heterogeneity and across study correlations. b D + and b D − are location parameters after suitable transformations of the original location and scale parameters μ D + , ϕ D + , μ D − , and ϕ D − of the Weibull, log-normal, and log-logistic distribution. Additionally, we define λ D + and λ D − as the scale parameters of the error terms ϵ D + and ϵ D − , which will be used explicitly for predicting sensitivities and specificities as presented in Section 3.3. As given in Hoyer et al, 11 we get (exemplary for the population of diseased) • for the Weibull distribution: b D + = log μ D + ð Þ and λ D + = 1=ϕ D + , • for the log-normal distribution: b D + = μ D + and λ D + = ϕ D + , and • for the log-logistic distribution: b D + = −ϕ − 1 D + log μ D + ð Þ and λ D + = 1=ϕ D + : Finally, ϵ D + and ϵ D − are the corresponding error terms that define the distributions after the logtransformation of y D + and y D − . These distributions are the Gumbel, normal, and logistic distribution with expected value 0 and scale parameters λ D + and λ D − in case the Weibull, log-normal, and log-logistic distribution are used on the original scale, respectively.
Parameter estimation of the model is based on the maximum likelihood principle while appropriately accounting for interval-censored data as, for example, presented by Kalbfleisch and Prentice. 17 Therefore, it is possible to use every statistical software that allows for maximizing likelihood functions with bivariate random effects.
Borrowing methods from time-to-event analysis leads to various advantages as we now have a model that uses the complete information on all evaluated thresholds given in the single studies. Moreover, the bivariate timeto-event model enables us to predict sensitivities and specificities at various thresholds by simply using survival functions. Additionally, we are not restricted to studies that report on the same numbers and values of thresholds.

| The generalized F distribution
The bivariate time-to-event model from Section 3.1 comes with the need of prespecifying a parametric distribution for the underlying diagnostic test values. This could be considered as a disadvantage because external information on the type of distribution is rarely available. Moreover, Hoyer et al 11 do not report on model selection criteria to distinguish between the Weibull, log-normal, and loglogistic model. To overcome these disadvantages, we propose to use the GF family of distributions, which includes the Weibull, log-normal, and log-logistic distribution as special cases and additionally comes along with two special features. First, we are much more flexible regarding possible distributions of the underlying diagnostic test values because the GF distribution has four parameters as compared with our previously used distributions with only two parameters. Second, as all distributions are members of the same family, classical model selection criteria as AIC or BIC are applicable to compare the fit of the different models. The central idea of our proposed extension of the model presented in Section 3.1 is therefore to use the more flexible GF family of distributions that includes previously used distributions as special cases. That means especially, model Equations (4) and (5) are kept, but we now assume that y D + and y D − follow distributions that are members of the GF family of distributions. Therefore, we still have a bivariate time-to-event model, but we now allow for more flexible distributions of the underlying diagnostic test values and also for comparison of different models via AIC or BIC.
We use the parametrization for the GF family as proposed by Prentice and Cox. 18,19 The probability density function for the population of diseased D + is Equations (7) where B m 1D + , m 2D + ð Þis the beta function evaluated at m 1D + and m 2D + . The parameters m 1D + , m 2D + , and δ D + are defined as follows: and with −∞ < q D + < ∞ and p D + > 0 . Analogously, the pdf can be defined for the population of non-diseased D − with the respective parameters b D − , λ D − , p D − , and q D − . The GF distribution is thus characterized by the four parameters p • , q • , b • , λ • and is therefore very flexible. Thereby, b • and λ • represent the location and scale parameter, respectively, and m 1• and m 2• as transformations of p • and q • denote the, potentially non-integer, degrees of freedom. Moreover, by restricting p • and q • to specific values, the number of parameters can be reduced, resulting in further well-known distributions as, for example, the generalized gamma, the generalized log-logistic, the Burr III, and Burr XII as well as the Weibull, log-normal, and log-logistic distribution. Figure 1 shows the relationships between the members of the GF family. Having experienced serious numerical problems with the generalized gamma distribution, we focus in the following on the GF, the generalized log-logistic, the Burr III and Burr XII, the Weibull, log-normal, and log-logistic distribution.

| Predicting sensitivities and specificities
As we are not explicitly interested in the actual parameters of the different underlying distributions, but in sensitivities and specificities at various thresholds, we use the best linear unbiased prediction (BLUP) principle to predict them at different thresholds using the corresponding survival functions. Focusing on sensitivity and estimating the parameters in the population of diseased (D + ), the survival functions for the Weibull, log-normal, and loglogistic distributions were already presented in Hoyer et al 11 for arbitrary diagnostic test values y: • for the Weibull distribution, • for the log-normal distribution, where Φ is the cdf of the normal distribution; and • for the log-logistic distribution, For the remaining members of the GF family, we obtain 19 for arbitrary diagnostic test values y: • for the generalized log-logistic distribution, where F Beta is the cumulative distribution function (cdf) of the beta distribution with • for the Burr III distribution, where , and • for the Burr XII distribution, where x BXII = m 2D + m 2D + + exp δ D + log y ð Þ− b D + ð Þ =λ D + ð Þ , a BXII = m 2D + , b BXII = 1, the parameters q D + and m 2D + being defined as • for the generalized F distribution, where

| SIMULATION
To compare the model using the generalized log-logistic, Burr III, Burr XII, and GF distribution with the previously used Weibull, log-normal, and log-logistic approach, 11 we conducted a simulation study. The corresponding simulation program was written in SAS 9.4 (SAS Institute Inc., Cary, North Carolina).

| Setting
For our simulation study, simulated data were drawn from the Weibull, log-normal, log-logistic, and GF distribution. That means, these four distributions served as our "true underlying distributions." We used the estimates of b D + , b D − , λ D + and λ D − for the Weibull, log-normal, and log-logistic distribution and, in addition, the estimates of p D + , p D − , q D + and q D − for the GF distribution from the diabetes example in Section 2 as true parameter values for our simulation. That means, the simulation settings mirror the situation from the diabetes data set. In case of the GF distribution, the true parameter values were chosen in a way that only the GF, but not the generalized log-logistic, the Burr III, or the Burr XII model, was the true model. Additionally, we varied the true correlation ρ between the random effects (by setting it to 0, 0.28, or 0.85) in order to mimic zero, moderate, and strong heterogeneity across studies. In Table 2, the true sensitivities and specificities used for our simulation study are depicted.

| Data generation
Combining all parameter settings led to 12 different simulation scenarios. We simulated 1000 meta-analyses for each of them. In line with Hoyer et al, 11 the number of studies for each meta-analysis was generated from a uniform distribution, ranging from 10 to 30. The number of participants per study was also drawn from a uniform distribution and varied between 30 and 300. Numbers of diseased participants were generated based on a uniformly distributed prevalence between 0.3 and 0.5. For the number of thresholds per study, we also used a uniform distribution to generate values between 1 and 4, which were rounded to the nearest integer. Analogous to Hoyer et al, 11 the actual threshold values were simulated from a uniform distribution where the range depended on the number of thresholds. To be concrete, for example, a study with two different thresholds will report on a first threshold between 5.6 and 6.0 and a second threshold between 6.4 and 6.8, whereas the first threshold from a study reporting on three thresholds varies between 5.4 and 5.8, the second one between 6.1 and 6.5, and the third one between 6.7 and 7.1.
The exact diagnostic test values for every study participant were simulated from the true underlying distribution that means from the Weibull, log-normal, log-logistic, or the GF distribution. In case of the Weibull, log-normal, and log-logistic distribution, we proceeded as proposed in Hoyer et al. 11 That means, we first generated a bivariate normally distributed random effect according to Equation (6) with predefined σ 2 D + , σ 2 D − , and ρ. With respect to Equations (4) and (5), this random effect was added to b D− and b D+ . In combination with previously defined λ D− and λ D+ , Gumbel, normal, and logistic distributed diagnostic test values on the log-scale were generated. Backtransforming led to diagnostic test values of each study participant on the original scale for the Weibull, log-normal, and log-logistic distribution. These values were finally compared with the generated thresholds aiming to classify true positives and true negatives as well as the resulting four-fold tables.
For generating random numbers from the GF distribution, we adapted the algorithm implemented in the flexsurv package of the statistical software R 20 to SAS. This algorithm is based on the representation of the GF family as given in Cox 19 and inverse transform sampling.

| Estimation methods and outcomes
For each of the 1000 meta-analyses per simulation scenario, we estimated the Weibull, log-normal, log-logistic, generalized log-logistic, Burr III, Burr XII, and the GF model to predict sensitivities and specificities at different thresholds. SAS PROC NLMIXED with Gaussian quadrature and its default options was used for parameter estimation. Starting values for this procedure were determined using SAS PROC LIFEREG in line with Hoyer et al 11 to get starting values for the fixed as well as for the random effects parameters. We used SAS PROC LIFEREG assuming a log-logistic distribution to obtain starting values for the generalized log-logistic, Burr III, Burr XII, and the GF model because this is a submodel of the other approaches. Moreover, starting values for p and q were set to 1 in the population of diseased, whereas q was set to 0 in the population of non-diseased in case the Weibull, log-normal, or log-logistic distribution served as true model. If the GF distribution was the true underlying distribution, starting values for p were set to 1.8 and 0.5 in the population of diseased and nondiseased, respectively. Starting values for q were assumed to be 0.5 in the population of diseased and −0.6 in the population of non-diseased. These choices were inspired by estimates for the diabetes example data set. Sensitivities and specificities were evaluated at predefined thresholds (5.0, 5.5, 6.0, 6.5, and 7.0), as well as their corresponding t-confidence intervals with degrees of freedom determined by the default option in SAS PROC NLMIXED.
We focused on bias, empirical coverage, number of converged simulation runs, and the chosen model as our outcomes of interest. For model choice, we used AIC and BIC, defined as follows: where f is the negative of the marginal log-likelihood,θ is the vector of parameter estimates, k is the number of parameters, and N the number of studies.

| Results
In the following, we report on the results of our simulation study, which is focused on the predicted sensitivities and specificities (as our main parameters of interest). Results for the scenario with a true underlying correlation of 0.28 and the GF distribution as true model are presented in Tables 3 and 4. All remaining simulation results can be found in the Supporting Information. Tables 5 and 6 show the complete results concerning numerical robustness and model selection.

| Bias
In terms of bias, the Weibull, log-normal, and log-logistic model performed best in case they were also the true underlying model. However, also the more complex models (ie, generalized log-logistic, Burr III, Burr XII, and GF) led to satisfying results that are mostly in magnitude of the log-normal and log-logistic model or even better. Especially in case the GF distribution was the true model, simpler models, as Weibull, log-normal, and loglogistic, performed slightly worse. However, the maximum amount of bias of the GF model was 8.3 when data were generated from the log-normal distribution. This was higher compared with maximum bias of the lognormal or log-logistic model with 5.4 and 6.8, respectively. Remarkably, the Burr III model led to worst results of all models, whereas in particular, the Burr XII and the GF model showed very low biases for all evaluated thresholds. In most cases, the different models overestimated sensitivities for lower thresholds but underestimated them for higher ones (specificity vice versa). No obvious dependence on the true underlying correlation structure was visible.

| Empirical coverage
Concerning empirical coverage (to the 95% level), the results of all models were rather satisfying but in many cases below the expected 95%. In line with bias, the best results were obtained if the estimated model was also the true underlying model. Especially, the Weibull model performed worse in case another model was used for simulating the input data set. The generalized log-logistic, Burr XII, and GF model performed well and without huge outliers regardless of the true underlying distribution. Moreover, these models led to superior results compared with the simpler ones if the GF distribution served as true model. This was observed especially for the lowest and highest evaluated threshold. While the Burr III model performed well in case the Weibull, log-normal, or log-logistic distribution were the true underlying models, results for the GF distribution were rather unsatisfactory, especially for higher thresholds where coverages of 23.2% were observed. Again, the underlying correlation structure had no impact on the results.

| Number of converged runs
In terms of numerical robustness, we report on the total number of converged simulation runs (out of 1000) for each model. These results were extremely satisfying, especially for the log-normal and log-logistic model, where for many settings, the maximum number of converged simulation runs was observed. For the Burr XII and the generalized F model, slightly worse results were obtained. But this was expected as these models go along with more and more complexity resulting in an increased numerical instability. Worst results were observed for the Burr III model in case the GF distribution served as true model.

| Model selection criteria
In Table 6, we report on the number that each model has been selected because of the minimal AIC or BIC. The T A B L E 3 Bias (in percentage points): sensitivity (sens) and specificity (spec); bold entries indicate the smallest bias maximum that can be achieved is therefore 1000. For the Weibull, log-normal, and log-logistic model, it is obvious that they were the models of choice in case they served also as the true distribution of the diagnostic test values. Settings based on the Weibull distribution and a correlation of 0.28 or 0.85 led to the result that in the majority of cases, the Weibull or the log-normal model was selected. For all remaining settings, the Weibull model was rarely chosen. If the log-normal model was the true underlying distribution, also the Burr III and Burr XII distributions were frequently selected. As expected, models with a higher number of parameters, namely, the Burr XII and the GF model, were preferred when data were generated from the GF distribution. Again, the Burr III model showed worst results in case the GF distribution was the true model.

| EXAMPLE
To illustrate the practical application of our model, we use the example of population-based screening for type 2 diabetes presented in Section 2. Focusing on the estimated sensitivities and specificities at various thresholds as the parameters with most practical relevance, the corresponding results are given in the Supporting Information. Additionally, Figure 2 depicts the different estimated SROC curves as well as the ROC curves from the underlying single studies. All SROC curves are generated based on predicted sensitivities and specificities at various thresholds. As the predicted sensitivities and specificities from the log-logistic, generalized log-logistic, Burr III, Burr XII, and GF model are nearly identical, SROC curves from these models are indistinguishable. However, with respect to model selection criteria (shown in Table 7), the model based on the GF distribution, which is actually the most complex one, has the lowest AIC and BIC and is therefore the model of choice for the diabetes example. In Table 7, we also report on the estimated values for m 1D + , m 1D − ,m 2D + , and m 2D − for the models that contain these parameters. It can be seen that all of them differ from 1 (albeit sometimes with large confidence intervals), indicating that the underlying distribution of the diagnostic test values in the diabetes data set differs from the Weibull, log-normal, or log-logistic distribution. Consequently, more flexible models as the GF model reflect the underlying distribution in a better way. The predicted sensitivities and specificities from the Weibull model differ, especially for lowest and highest thresholds, from the results of the other models, which is due to the left-skewness of the distribution. At the currently recommended HbA 1c threshold of 6.5, 15,16 all models lead to high specificities between 98% and 99% and low sensitivities of about 31%. As we aim to evaluate whether HbA 1c is an appropriate primary screening test for type 2 diabetes, we are more interested in higher sensitivities that are associated with lower diagnostic thresholds. Moreover, even in case sensitivity and specificity are weighted equally (which corresponds to a choice based on the Youden index), the optimal threshold would be chosen between 5.5 and 6.0, which is quite lower than in current guidelines.
Finally, we estimated the areas under the curve (AUCs) for each model using the trapezoidal rule and a nonparametric resampling bootstrap for the 95% confidence intervals (Table 7). Reflecting the virtually indistinguishable SROC curves for the models with more than two parameters, also their AUCs are largely identical.
In the Supporting Information, we present an SAS macro that can be used to fit the generalized log-logistic, Burr III, Burr XII, and the GF model.

| DISCUSSION
In this article, we proposed an extension of the already published bivariate time-to-event model for interval-censored data, which can be used to estimate SROC curves. 11 This approach allows us to use information on each and every threshold reported by the single studies included in a meta-analysis. In contrast to existing approaches that can also use this information, 1-10 we suggest to model the distribution of the underlying diagnostic test values by using the flexible GF family of distributions. This family is characterized by four parameters and includes the Weibull, log-normal, log-logistic, generalized log-logistic, Burr III, and Burr XII distribution as special cases.
Therefore, our suggested model can cope with distributions, which are more flexible than the Weibull, log- In a small simulation study, we showed that the generalized log-logistic, Burr III, Burr XII, and GF model led to satisfying results that are comparable with the Weibull, log-normal, and log-logistic approach and in some cases even better.
Of course, our model has some limitations. Because the GF distribution is very flexible, it might be difficult to estimate the four parameters in case only small data sets are available. The still existing necessity of pre-specifying the distribution of the diagnostic test values might be considered another disadvantage of our parametric approach. It would be a definite advantage to have an alternative nonparametric approach available, for example, for more irregular or mixtures of distributions. Moreover, the estimated distributions of the diagnostic test values are only marginal ones, and distributions of the diagnostic test values in the single studies may differ from the overall one. Finally, we proposed to model the diagnostic test values on a log-scale. Depending on the diagnostic test or biomarker under evaluation, it could be meaningful to forgo this transformation and to use other  distributions as the normal or logistic distribution (as for example reported by Steinhauser et al 9 ) instead of members of the GF family of distributions. However, especially, the GF distribution is very flexible (see, for example, Cox 19 ) and can cope with a wide range of skewness, thus neglecting this slight disadvantage. In summary, bivariate parametric models for intervalcensored data based on the GF family of distributions are a very flexible approach for the meta-analysis of full ROC curves.

Highlights
What is already known: • Methods for the meta-analysis of diagnostic studies frequently focused on summarizing only a selected single pair of sensitivity and specificity from each study, which leads to a waste of the majority of observations. Methods for the meta-analysis of ROC have been proposed; however, they come with several disadvantages.

What is new:
• We extend the previously used bivariate time-to-event model for interval-censored data for the meta-analysis of full ROC curves by referring to the flexible family of generalized F (GF) distributions, which is governed by four parameters and includes all of the previously used distributions as special cases. As a consequence, model selection criteria as the AIC or the BIC to choose the most appropriate model are applicable.

Potential impact for RSM readers outside the authors' field
• In this article, we showed how methods from survival analysis can be used for the meta-analysis of full ROC curves based on a flexible set of distributions. The proposed model can be used in various medical applications to evaluate diagnostic tests.