Inference in receiver operating characteristic surface analysis via a trinormal model‐based testing approach

Receiver operating characteristic (ROC) analysis is the methodological framework of choice for the assessment of diagnostic markers and classification procedures in general, in both two‐class and multiple‐class classification problems. We focus on the three‐class problem for which inference usually involves formal hypothesis testing using a proxy metric such as the volume under the ROC surface (VUS). In this article, we develop an existing approach from the two‐class ROC framework. We define a hypothesis‐testing procedure that directly compares two ROC surfaces under the assumption of the trinormal model. In the case of the assessment of a single marker, the corresponding ROC surface is compared with the chance plane, that is, to an uninformative marker. A simulation study investigating the proposed tests with existing ones on the basis of the VUS metric follows. Finally, the proposed methodology is applied to a dataset of a panel of pancreatic cancer diagnostic markers. The described testing procedures along with related graphical tools are supported in the corresponding R‐package trinROC, which we have developed for this purpose.

diagonal of the unit square (also called the chance diagonal). If a marker results in complete separation between the two distributions, we obtain an ROC curve that passes through the point with coordinates (0,1) in the ROC space (i.e., the unit square). The area under the ROC curve (AUC) is an overall index of the diagnostic accuracy of the marker under study and can be obtained by integrating the function of the ROC curve, where F ℓ is the cumulative distribution function (cdf) of class D ℓ , where ℓ=−,+ and G ℓ =1− F ℓ is the corresponding survival function. It can be shown that the AUC is equal to the probability that a randomly chosen diseased individual attains a higher value than a randomly chosen nondiseased individual (Krzanowski & Hand, 2009). Statistical inference on the discriminatory power of a marker can be based on the ROC or the ROC curve.
Parametric and nonparametric approaches have been proposed in the literature for the assessment of the diagnostic accuracy of a marker or for the comparison of the diagnostic accuracy of competing markers in two-class problems. Specifically, widely used nonparametric approaches (DeLong et al., 1988), where two or more markers are compared through their corresponding empirical AUC. A test that involves the empirical ROC curve itself using a permutation testing procedure also exists (Venkatraman, 2000;Venkatraman & Begg, 1996). Notice that the empirical AUC is equivalent to the Mann-Whitney U statistic (Bamber, 1975).
The binormal model is widely used in the parametric setting. It assumes that the i=1, …, n ℓ , are independent and identically distributed measurements from class D ℓ and are either normally distributed, that is, X ℓi ∼ Nðμ ℓ ; σ 2 ℓ Þ for ℓ=−,+, or they can be transformed to normality through a common, latent transformation procedure (Box & Cox, 1964) adapted to the ROC curve context (Molodianovitch et al., 2006). Under the binormal model, the ROC curve has the form ROCðtÞ ¼ Φð , t∈[0,1], with Φ being the cdf of the standard normal distribution. McClish (1989) proposed an estimate for the variance of the AUC on the basis of the binormal model. Wieand et al. (1989) introduced a statistical test for comparing two classifiers under the binormal model on the basis of the respective expression of the AUC. An approach for the comparison of two classifiers using the binormal model assumptions and the corresponding ROC curve directly, without making use of the AUC, exists in the literature (Metz & Kronman, 1980;Metz et al., 1984). The corresponding parameters of the ROC curve are compared, resulting in an approximately chisquared distributed test that makes use of the shape of the ROC curve directly, rather than its corresponding AUC, with the latter being a proxy/summarizing function.
Whereas classification procedures to one of two classes have been investigated for over half a century, multiple-class classification problems have only recently acquired more attention (Nakas & Reiser, 2018;Liu et al., 2018). A detailed overview of the three-class diagnostic setting is provided in Nakas (2014). For the latter, an intermediate class D 0 is considered, which can be a transitional or early-stage disease class in practice.
Clinical examples involve dementia disease states in Alzheimer's disease, Parkinson's disease, or HIV, with patients belonging to classes with normal cognition, mild cognitive impairment, or dementia. The ROC surface was introduced as a generalization of the ROC curve, and the summarizing index of the volume under the ROC surface (VUS) for the assessment of the diagnostic accuracy of a marker was used in a three-class classification task (Scurfield, 1996).
The three-class diagnostic setting is briefly described as follows. A diagnostic marker yields measurements X=x on a continuous scale for all three groups. Without loss of generality, we assume that higher values tend to be associated with increased severity of disease. We assume that the true class membership for each individual is known (i.e., the reference standard is independently available). The i=1, …, n ℓ , independent and identically distributed measurements from class D ℓ are denoted by X ℓi for ℓ=−,0,+. We write f ℓ as the probability density function in class D ℓ and F ℓ as the corresponding cdf in group D ℓ . G ℓ =1− F ℓ is the survival function. An intuitive decision rule regarding the application of the diagnostic marker for assigning subjects into three ordinal diagnostic groups is based on a pair of cut-off points c − and c + , where c − <c + . Then, assign individuals with x ≤ c − to the healthy class D − , individuals with c − <x ≤ c + to D 0 , and x>c + to the diseased group D + . We set t − = F − (c − ) and t + =G + (c + ). The ROC surface is defined as the probability that a randomly selected subject from group D 0 has a test result between c − and c + , which can be written as where G −1 þ and F −1 − are the inverse functions of G + and F − , respectively (Nakas & Yiannoutsos, 2004). The function z=ROCs(t − ,t + ) defines a surface in the unit cube, which means (t − , t + , z) ∈ [0,1]×[0,1]×[0,1]. The assumption c − <c + implies that the ROC surface is only defined on the domain It follows that perfect discrimination is present if there is complete separation between the distributions of the three classes; that is, if ROCs(1,1)=1. On the other hand, if the three distributions are congruent, that is, x, then the ROC surface is equivalent to the chance plane defined by the equation t − +t + +z=1. Such a classifier will have no better discriminatory power than a random allocation function and is deemed to be an uninformative classifier.
The VUS is defined as which summarizes the global diagnostic accuracy for trichotomous tests. It holds that (Mossman, 1999;Xiong et al., 2006) that is, the probability that the results of the diagnostic test from a randomly selected triple with one individual from each diagnostic group will be ranked in the correct order. From the definition of the ROC surface and the convention of the ordered classes according to the disease state, it follows that the boundaries of the VUS are given by 1/6 for an uninformative marker and 1 in the case of perfect separation between the corresponding distributions of the three classes under study.
In what follows, we review the trinormal ROC surface model in Section 2 and existing VUS-based testing approaches relevant to hypothesis testing in the ROC surface analytical framework in Section 3. We then introduce in Section 4 a trinormal model-based ROC testing framework extending the ROC curve setting (Metz & Kronman, 1980;Metz et al., 1984). Section 5 presents a simulation study examining the proposed approaches and comparing them with existing inferential procedures in the ROC surface framework. Section 6 offers an illustration using data from a published study on pancreatic cancer diagnostic markers (Leichtle et al., 2013). We end with a discussion and refer to the use of the accompanying R-package trinROC, which we have specifically built for ROC surface analysis applications.

| THE TRINORMAL ROC MODEL
Considering that the data from the three classes D ℓ , ℓ=−, 0, + are normally distributed, we write the i=1, …, n ℓ , independent and identically distributed measurements from class D ℓ as X ℓi ∼ Nðμ ℓ ; σ 2 ℓ Þ. Otherwise, transformation functions such as the Box-Cox approach may be applied in order to obtain data that can reasonably be considered to be normally distributed. Bantis et al. (2015) have described a detailed implementation of the Box-Cox approach in the ROC surface context.
We reformulate the functional form in Equation (1) as follows: where the parameters a, b, c, and d are given by Estimation of an ROC surface using the trinormal model can be performed by estimating the parameters a, b, c, and d with maximum likelihood estimators of the means b μ ℓ ¼ ∑ nℓ i¼1 X ℓi =n ℓ and variances b σ 2 ℓ ¼ ∑ nℓ i¼1 ðX ℓi −b μ i Þ 2 =n ℓ for ℓ=−,0,+. The corresponding VUS is then obtained by reformulating Equation (2), as follows (Xiong et al., 2006): A shorter expression for the VUS is given by where φ is the density of the standard normal distribution.

| COMMON METHODS: HYPOTHESIS TESTING WITH THE VUS
The existing literature only involves testing procedures that use overall summary indices of the ROC surface for the assessment of the diagnostic accuracy of markers in the three-class setting (Krzanowski & Hand, 2009). As a result, the comparison of Classifier 1 with Classifier 2 will most often involve testing H 0 :VUS 1 =VUS 2 against H 1 :VUS 1 ≠VUS 2 . When diagnostic markers that have been tested/applied to the same set of patients are compared, there is an inherent correlation between marker measurements. This correlation is passed down to the corresponding estimated VUSs and has to be taken into account in the respective hypothesis-testing procedures. In the simple case where we investigate the performance of a single marker, the null hypothesis of interest is H 0 :VUS 1 =1/6, where 1/6 is the VUS of the chance plane. Representative tests involving VUS are presented in the sequel.

| Trinormal VUS testing approaches
In the trinormal ROC model framework, one can use the following testing procedures for the comparison of paired diagnostic markers (statistic Z paired ) and unpaired diagnostic markers (statistic Z unpaired ) and for the assessment of a single marker, respectively (statistic Z single ), assuming that marker values are normally distributed in each class. The unpaired case arises when the two markers being compared have been tested independently on different sets of individuals. The paired case is more frequent in practice when researchers compare a panel of markers whose data arise from the same experiment on the same set of individuals.
These statistics are considered to follow a standard normal distribution under the null hypothesis. Detailed formulae for the computation of the variance and covariance estimates have been proposed (Xiong et al., 2006(Xiong et al., , 2007. We use "VUS test" to denote such an approach in the simulation section.

| An empirical VUS test
Although the trinormal model assumes normality, the empirical nonparametric approach poses no parametric distributional assumptions on the data but is based on the empirical cdf for each class. For each triplet (X −i , X 0j , X +k ) of measurements from the three classes, the function of correct orderings (ratings) is as follows: Then the empirical VUS can be estimated by Nonparametric methods involving g VUS for the comparison of competing diagnostic markers and for the assessment of the diagnostic accuracy of a single marker in discriminating between three diagnostic groups exist in the literature (Dreiseitl et al., 2000;Nakas & Yiannoutsos, 2004).

| Comparing two markers
In the two-class case, that is, under the ROC curve framework, Metz and Kronman (1980) and Metz et al. (1984) obtained testing procedures, for the unpaired and paired cases, to compare two ROC curves in a binormal setting using only the parameters of the binormal model. We adapted this concept in the three-class setting (under the ROC surface framework) and propose a test statistic on the basis of the parameters of the trinormal model-based ROC surface given in Equation (3). Denote the two markers by Classifier k, for k=1,2, with their corresponding parameters indexed (4) are asymptotically multivariate normally distributed (Dorfman & Alf, 1968). This result holds for the binormal model-based ROC curve, but it is trivial to extend for the ROC surface. Consequently, the null hypothesis of interest is H 0 : a 1 =a 2 , b 1 =b 2 , c 1 =c 2 , d 1 =d 2 against H 1 : a 1 ≠a 2 or b 1 ≠b 2 or c 1 ≠c 2 or d 1 ≠d 2 . In order to assess the evidence against the hypothesis of equality of two ROC surfaces, we obtain the following test statistic, which is approximately chi-squared distributed with four degrees of freedom: where for unpaired data, c with entries given by the Delta method as The null hypothesis will be rejected if X 2 > X 2 α , that is, if the test statistic exceeds the chi-squared distribution with four degrees of freedom quantile at a predefined significance level α.
When the marker measurements are unpaired, the estimated parameters b a 1 ; b b 1 and hence all covariances for estimates between the two ROC surfaces are zero. As a consequence, c W can be written as the sum of the covariances of the two sets of parameters. However, when the marker measurements are paired, in order to account for the correlation introduced, we define The trinormal model-based ROC test may then be written as which follows approximately a chi-squared distribution with four degrees of freedom.
The entries of b C are given by and are estimated similarly to the two-class case. Repeated applications of the Delta method are used to show that the entries of the symmetric c W * entries are given by b w where b ρ ℓ are the corresponding pairwise Pearson correlation coefficients. The above elements are sufficient to define this symmetric matrix. We reject H 0 if X 2 > X 2 α just as we did in the unpaired case. We use "ROC test" to denote such an approach in Section 5.

| Assessment of a single marker
In the previous section, we have seen how two ROC surfaces can be compared. It is also possible to asses a single marker using a similar strategy, namely, by comparing an ROC surface with the chance plane (i.e., an uninformative ROC surface) leading to the null hypothesis H 0 : a 1 =0, b 1 =0, c 1 =1, and d 1 =0. The corresponding test statistic is with c W 1 defined as in Equation (10). Under the null hypothesis, X 2 follows approximately a chi-squared distribution with four degrees of freedom.

| SIMULATION STUDY
The simulation study consists of three parts, where in the first two parts we investigate the performance of the proposed testing procedures given in Sections 3 and 4. In the first part, we investigate tests that assess single markers in their deviation from the chance plane in order to evaluate whether a classifier performs significantly better than a random allocation procedure. In the second part, we assess the tests relevant to the comparison of two diagnostic markers. In these first two simulation parts, we sampled the data from underlying normal distributions. The third part of the simulation study evaluates the performance of the proposed procedures after applying the Box-Cox transformation for data normalization.
Log-normal and gamma distributions were considered for the sampling scenarios. The simulation was carried out using R (RStudio Team, 2016) and the R-package trinROC (Noll, 2019) specifically developed for the implementation of the proposed and widely used competing methodologies in a ROC surface analysis.
The factor stated as "crossing" flags the existence/importance of differences between variances of the three classes. Three "crossing" scenarios were considered: "no crossing" (equal variance between the three classes), "slight crossing" (an intermediate situation), and "strong crossing" (significant differences in variances between the three classes). Figure 1 displays the effect on the ROC surface for the three different scenarios of differences in variances between the three groups. When differences between variances of the distributions of the three classes exist, the ROC surface crosses the chance plane.
For the simulation on the comparison between markers, we also distinguish between unpaired and paired data by sampling from bivariate normal distributions using ρ=0 and ρ=0.5 for each of D − , D 0 , and D + . Typically, paired data arise when marker measurements are obtained from markers applied on the same set of individuals, whereas unpaired data are less frequent in practice considering cases where measurements arise from independent studies for the markers under comparison. We considered the trinormal model-based ROC test (ROC test), the trinormal VUS test (VUS test), and the bootstrap-based test (Boot test, with 500 bootstrap replications). Each result is based on 1000 iterations and an α level of 5%.

FIGURE 2
Empirical power per 1,000 iterations and an α significance level of 5%, based on the simulation results of the tests that assess single markers. The columns indicate the sample size, and the rows represent the three different assumptions of variability in D − , D 0 , and D + . Each dot corresponds to a different scenario with increasing VUS. ROC, receiver operating characteristic; VUS, volume under the ROC surface FIGURE 1 ROC surfaces representing the general shape of the surface for each level of "crossing" in the single-marker simulation study (based on samples with a theoretical VUS=0.4 and n ℓ =100). The effect of differences between variances of D − , D 0 , and D + on the ROC surface is apparent. The ROC surface crosses the chance plane in the "slight crossing" and "strong crossing" scenarios. ROC, receiver operating characteristic; VUS, volume under the ROC surface

| Single-marker assessment tests
Means and standard deviations for the distributions of D − , D 0 , and D + were chosen so that the true VUS equals 1/6, 0.2, 0.25, 0.3, 0.35, and 0.4 for each one of the six scenarios studied in this part of the simulation. These VUS scenarios cover cases with markers of increasing accuracy starting from the case of a marker with uninformative corresponding VUS (equal to 1/6). For each one of the six different VUS scenarios, the factor "crossing" was set at the following: "no crossing" implying (σ − , σ 0 , σ + ) = (1, 1, 1), "slight crossing" with (σ − , σ 0 , σ + ) = (1, 1.5, 2), and "strong crossing" with (σ − ,  Note. The shift parameter s ℓ indicates how the samples from the classes were shifted in order to attain the desired VUS. Abbreviation: VUS, volume under the receiver operating characteristic surface.  σ 0 , σ + ) = (1, 2, 3). Equidistant means between the three classes were used for fulfilling the VUS scenario assumptions. Figure 2 summarizes the first part of the simulation.
For the "no crossing" scenarios, we find that the three tests have a similar empirical power, with the trinormal model-based VUS test only very slightly dominating the others. As the sample size n ℓ increases, the empirical power also increases for VUS≠1/6 in all the tests. The cases of different variances between D − , D 0 , and D + are reflected on the second and third rows of Figure 2. The proposed trinormal model-based ROC test has consistently higher power than its competitors. As it is designed to detect differences in the parameters of the ROC surface, it rejects the null hypothesis much faster than the VUS-based tests. It correctly rejects when VUS=1/6 given that the shape of the ROC surface deviates from the chance plane. For a standard deviation ratio of 1:2:3 between D − , D 0 , and D + and sample size of 50 or bigger for each class, the trinormal modelbased ROC test will virtually always reject the null hypothesis, whereas VUS-based tests remain invariant for such differences relative to the VUS=1/6 scenario. This result was expected, given that VUS is just a proxy metric that does not capture the shape of the ROC surface per se.

| Comparison of two classifiers
In the second part of the simulation study, two marker comparisons were performed. We chose to investigate two main scenarios: in the first, a set of increasing VUS 1 = 0.2, 0.25, 0.3, 0.35, 0.4, 0.45 is compared with an uninformative marker with VUS 2 =0.2. In the second, VUS 1 =0.6, 0.65, 0.7, 0.75, 0.8, 0.85, whereas VUS 2 = 0.6. As in the first part of the simulation, samples were drawn from normal distributions with equidistant means and standard deviations depending on the factor "crossing," such that the scenarios' VUS assumptions were fulfilled. For the two marker comparison simulations, we defined "no crossing" as (σ −1 , σ 01 , σ +1 , σ −2 , σ 02 , σ +2 ) = (1, 1, 1, 1, 1, 1), "slight crossing" as (1, 1.25, 1.5, 1, 1.5, 2), and "strong crossing" as (1, 1.5, 2, 1, 2, 3), in analogy to the single-marker assessment simulation. Figure 3 depicts the simulation results for the paired data case where, for each class, measurements were drawn from bivariate normal distributions with ρ=0.5. As the simulation results for the unpaired data scenarios turned out to be very similar to those obtained from the paired setting, they are not presented here but can be found in the Supporting Information. The following two scenarios can be found also in the Supporting

| A simulation study involving the Box-Cox transformation
In the third part of the simulation, we sample from distributions other than normal. We calculated the performance of the VUS test and the ROC test introduced above before and after the application of the Box-Cox transformation (Bantis et al., 2015) with the nonnormally distributed data for single-marker assessment and for the comparison of two markers. As the results of the boot test are invariant under monotone transformations, we excluded this test from this simulation part.
We sampled from log-normal and gamma distributions. We set n ℓ =25, 50, 100. For the control of the "crossing" factor, as there exists no closed formula to compute the theoretical VUS of nonnormal data, we had to numerically calculate suitable parameters of the distributions in order to obtain the desired VUS and variability among the three classes. The parameters considered are given in Table 1. Each result is based on 1,000 iterations and an α significance level of 5%.

| APPLICATION TO A PANCREATIC CANCER DATASET
We investigate a panel of 12 diagnostic markers on the basis of measurements from 106 individuals who underwent a complete assessment for pancreatic cancer. The dataset used was first described in (Leichtle et al., 2013). The three classes under study consisted of pancreatic cancer patients (D + ), pancreatitis patients (D 0 ), and healthy controls (D − ). Table 3 shows the empirical VUS and the corresponding trinormal VUS of the Box-Cox transformed data as well as the p values of the singlemarker assessment of the three tests introduced above.
A single-marker investigation yielded highly significant p values (<.001) for most markers for all three tests. Only marker WBC was not significant for any test, whereas marker TSH was only significant for the proposed test but not for the VUS-based ones. The VUS for the markers deemed significant through the VUS-based tests varies between 0.18 and 0.6. False discovery rate-adjusted p values for pairwise comparisons are shown as heat maps in Figure 4.
Marker CA-199 is consistently the best among the markers under consideration for differentiating between the three classes. However, this result is more pronounced using the trinormal model ROC-based test, because the distributions of marker measurements are highly variable, negatively affecting VUS-based tests only. The utility of marker CA-199 in pancreatic cancer assessment is well documented in the literature (Leichtle et al., 2013).

| DISCUSSION
In this article, we have introduced a trinormal model ROC-based test that can be used to assess diagnostic markers in a three-class setting, both for the assessment of a single marker and for the comparison of two competing markers arising from the same set of data or from independent measurements. Application of the Box-Cox transformation in the ROC surface framework (Bantis et al., 2015), prior to the use of the proposed test, can be considered when significant departures from normality assumptions exist for marker measurements.
Typically, in omics applications, thousands of biomarkers may be assessed simultaneously through computational procedures that involve ROC analysis techniques. A metric such as the VUS may fail to recognize significant biomarkers in cases where the corresponding VUS has a low value; however, the corresponding biomarker is informative. In these cases, the ROC surface crosses the chance plane leading to a low VUS. Modelling the surface per se for hypothesis testing may reveal valuable diagnostic markers in cases where the VUS would fail.  The R-package trinROC that accompanies this article is available on CRAN, involves testing and plotting options, and can be used as a general-purpose package for three-class diagnostic testing in the ROC framework. Description of its use is offered in the relevant vignette and reference manual.