Impact of correlations between prioritized outcomes on the net benefit and its estimate by generalized pairwise comparisons

Benefit-risk balance is gaining interest in clinical trials. For the comprehensive assessment of benefits and risks, generalized pairwise comparisons are increas-ingly used to estimate the net benefit based on multiple prioritized outcomes


INTRODUCTION
Randomized clinical trials play a key role in establishing the standard treatment. To better understand the characteristics of the treatment options, assessments in clinical trials are usually based on multiple efficacy and safety endpoints. Although the results for those endpoints are often analyzed and reported individually, the recently proposed regulatory recommendations for the comprehensive evaluation of benefits and risks 1,2 have led to the development of various methods to simultaneously analyze efficacy and safety. [3][4][5] When analyzed simultaneously, several authors have claimed that the correlation between efficacy and safety should be taken into account. 4,6,7 Note that in this paper, positive correlations refer to the tendency of two favorable outcomes to coexist unless otherwise specified. If efficacy and safety are positively correlated, patients who experience efficacy, compared to those who do not, are less likely to suffer from toxicity. If the two are negatively correlated, patients who experience efficacy, compared to those who do not, are more likely to suffer from toxicity. When the marginal efficacy and safety remain unchanged, the former option increases the chance of efficacy without toxicity (most favorable), although it also increases the chance of toxicity without efficacy (least favorable). 4,6 The net benefit based on prioritized outcomes is an emerging benefit-risk measure, which naturally incorporates the outcome correlations in its definition. 3,6,8 The net benefit can be described as the probability that a subject in the experimental arm has a more favorable overall outcome than one in the control arm minus the probability of the opposite occurring. Previous studies have provided examples of different correlation settings leading to different values of the net benefit and estimates from generalized pairwise comparisons. [6][7][8] However, the direction and magnitude of this impact and their theoretical foundations are not well understood, especially with different correlation values among treatment arms. In addition, despite the possible bias with censored survival data, the impact on the estimated values has not been fully differentiated from the impact on the true values. To better understand the net benefit, it is crucial to evaluate both its theoretical characteristics at the population level and the statistical performance of its estimation methods.
The current study sought to examine the impact of outcome correlations on the net benefit and its estimate under various correlation settings. After briefly summarizing the definition and the estimation of the net benefit based on two prioritized outcomes, we evaluate the impact of correlations between two binary or two Gaussian variables on the true net benefit values via theoretical and numerical analyses. We then explore the impact of correlations between survival and categorical variables on the net benefit estimates in the presence of right censoring via simulation and application to real data. Finally, we discuss the findings and draw conclusions about the interpretation of the net benefit and its estimate in the presence of outcome correlations.

NET BENEFIT BASED ON TWO PRIORITIZED OUTCOMES: A REVIEW
This section summarizes the definition and estimation of the net benefit based on two prioritized outcomes. 3,[9][10][11]

Definition of the net benefit
We consider a comparison between the experimental arm (labeled as e) and control arm (labeled as c) based on two prioritized outcomes X and Y (X being of higher priority). The outcomes can be either discrete (eg, response to treatment, toxicity grade) or continuous (eg, time-to-event, laboratory data). Let (X e , Y e ) ∼ P e and (X c , Y c ) ∼ P c be two mutually independent sets of variables to be observed in the experimental and control arms, where P e and P c denote the underlying distributions of (X e , Y e ) and (X c , Y c ), respectively. For each endpoint Z (Z = X, Y ), sets of pairs classified as win, loss, and draw are defined as W z = {(z e , z c ) ∶ z e − z c > Z } , respectively, where Z ( Z ≥ 0) is the threshold for a meaningful difference. For binary endpoints, in particular, the win/loss/draw sets are The score of the pair for the endpoint Z is defined as where 1{⋅} is an indicator function. The score of the pair based on prioritization is defined as Because of the outcome prioritization, the score on Y is utilized only when the pair is a draw on X. The net benefit is defined as the expected value of this score:

Estimation of the net benefit with complete data
We consider a randomized trial of n e experimental and n c control subjects. If complete data are provided for all the par- ∼ P e for i (i = 1, … , n e ) and ( X cj , Y cj ) i.i.d. ∼ P c for j (j = 1, … , n c ), the estimator for Δ via generalized pairwise comparisons 3 Δ = (n e n c ) −1 is an unbiased estimator of Δ: irrespective of the correlations between the outcomes.

Estimation of the net benefit in the presence of right censoring
If the survival times are censored on some study subjects, the scores for some of the pairs may be unknown. The original Gehan's scoring rule treats these uninformative pairs as equivalent to the draw pairs, leading to biased estimates. 9,10 One proposed solution, Péron's scoring rule, estimates the win/loss/draw probabilities conditional on the observed data using the Kaplan-Meier survival function. 9 This method is the default in the R package BuyseTest version 2.3.11. The well-known drawback of Péron's scoring rule is, however, that the survival function may not be fully estimated with limited follow-up, and that the remaining uninformative probabilities may still induce bias. 10 A further correction to these rules has been proposed, which imputes the contribution of the uninformative based on the average informative (win/loss/draw) scores estimated using Gehan's or Péron's scoring rule. 10 So far, the bias of these methods has been investigated separately from the impact of correlations. 9,10 Our simulation study therefore assesses the performance of the four estimation methods (Gehan, Péron, Gehan with correction, and Péron with correction) in relation to outcome correlations.

IMPACT OF CORRELATIONS BETWEEN OUTCOMES ON THE NET BENEFIT
In this section, we examine the impact of outcome correlations on the true values of the net benefit. We focused on simple settings with two binary or two Gaussian variables for theoretical considerations and numerical computations. The numerical computations were carried out using mvtnorm package version 1.1-3 under R version 4.2.1.

Theoretical consideration
When X and Y are both binary, the net benefit can be decomposed into win/loss contributions of each endpoint such that Here, p win,X and p loss,X are the probabilities for win/loss at X, and p win,Y and p loss,Y are the probabilities for win/loss at Y after a draw at X. The difference between the win and loss is the contribution of each endpoint to the overall net benefit: Now, in each arm a (a = e, c), we use X,a = P [X a = 1] (eg, the probability of response to treatment), Y ,a = P [Y a = 1] (eg, the probability of absence of adverse events), and XY,a = P [X a = 1, Y a = 1]. If XY,a > X,a Y ,a , X and Y are positively correlated in arm a. If XY,a < X,a Y ,a , X and Y are negatively correlated in arm a. Note that the range of XY,a is determined by X,a and Y ,a : The net benefit can then be expressed as By expanding this equation in terms of XY,e and XY,c , we obtain When X,c < 50%, the net benefit is higher with a negative correlation in the experimental arm (smaller XY,e ), whereas when X,c > 50%, the net benefit is higher with a positive correlation in the experimental arm (larger XY,e ). When X,e < 50%, the net benefit is higher with a positive correlation in the control arm (larger XY,c ), whereas when X,e > 50%, the net benefit is higher with a negative correlation in the control arm (smaller XY,c ).

Numerical examples
Here, we present numerical examples for the net benefit based on two binary outcomes with various correlations and marginal probabilities in each arm. We assume that in each arm a (a = e, c), the binary variables (X a , Y a ) are determined by latent bivariate Gaussian variables ( where Φ(⋅) is the cumulative distribution function of the standard Gaussian distribution. This Gaussian copula-based model achieves the values of X,a and Y ,a regardless of the correlation a , and the value of XY,a is larger with a stronger positive correlation. Theoretical value of XY,a can be obtained through numerical integration 8,12 : is the joint density function of ( . The win/loss contributions of each endpoint and the net benefit can be calculated by simple algebra using these values. We considered three types of experimental treatment: (A) X,e = X,c − 10% and Y ,e = Y ,c + 20% (eg, less effective and less toxic); (B) X,e = X,c and Y ,e = Y ,c (eg, equally effective and equally toxic); (C) X,e = X,c − 3% and Y ,e = Y ,c − 3% (eg, less effective and more toxic). For each combination of risk differences, we computed the values of XY,e and XY,c with systematically varied correlations ( e ∈ {−0.9, −0.45,0, 0.45,0.9} and c ∈ {−0.9, −0.45,0, 0.45,0.9}), probabilities of a favorable outcome at X ( X,c ∈ {30%, 50%, 55%, 60%, 80%}), and probabilities of a favorable outcome at Y ( Y ,c ∈ {5%, 40%, 75%}) (excerpts presented in Appendix S1: Table S1).
The values of the net benefit and their breakdowns are presented for scenarios with X,c = 30% and Y ,c = 40% in Table S2A to C. Within each scenario group A to C, the contribution of Y and the net benefit were highest when the correlation was negative in the experimental arm and positive in the control arm. In scenario group B (risk differences: 0% on X and Y ), the contribution of Y and the net benefit exceeded 0% when e < c (eg, net benefit 10.8%, the contribution of X 0.0%, the contribution of Y 10.8% with e = −0.9 and c = 0.9). Even in scenario group C (risk differences: −3% on X and Y ), the contribution of Y and the net benefit exceeded 0% mainly when e < 0 and c > 0 (eg, net benefit 6.2%, the contribution of X − 3.0%, the contribution of Y 9.2% with e = −0.9 and c = 0.9). Figure 1 shows the net benefit values for each scenario. The net benefit values ranged from −7.8% to 16.0% in scenario group A (risk differences: −10% on X and 20% on Y ), from −10.8% to 10.8% in B, and from −17.1% to 6.2% in C, depending on the correlations and marginal probabilities. In scenarios with X,c < 50%, the net benefit was higher when the correlation was negative in the experimental arm (column 1), whereas in scenarios with X,c > 50%, the net benefit was higher when the correlation was positive in the experimental arm (columns 3-5), irrespective of the risk differences and probabilities on Y . Outcome correlations in the control arm also impacted the net benefit in a direction consistent with our theoretical results.

Theoretical consideration
When X and Y are both continuous, the net benefit can be decomposed into win/loss contributions of each endpoint such that F I G U R E 1 Net benefit in the binary-binary case according to the risk differences on X and Y (A −10% and 20%; B 0% and 0%; C −3% and −3%), the probability of a favorable outcome at X in the control arm (columns), and the probability of a favorable outcome at Y in the control arm (rows). The lines link the results with the same correlation values for both arms Assuming that X and Y follow bivariate Gaussian distributions for a = e, c, the distributions of the difference and conditional distributions are also Gaussian: , , and Y |X,c = √ 1 − 2 c . The win/loss contributions of each endpoint can then be expressed as where f (x e ) and f (x c ) are the marginal density functions of X e and X c . Compared to the case of binary variables, the impact of correlations on the net benefit could be more complex in the Gaussian-Gaussian case. The impact of correlation within each arm may vary according to the threshold values, mean differences, and the correlation in the other arm.

Numerical examples
Here, we present numerical examples for the net benefit based on two Gaussian outcomes with various correlations and treatment effects (mean differences). The win/loss contributions defined above can be obtained for given thresholds X and Y using the Gaussian density and distribution functions and numerical integration, and the corresponding net benefit values can be calculated using simple algebra.
We considered scenarios with systematically varied correlations ( e ∈ {−0.9, −0.45,0, 0.45,0.9} and c ∈ {−0.9, −0.45,0, 0.45,0.9}) and treatment effects ( X,e ∈ {−0.6, −0.3,0, 0.3,0.6} and Y ,e ∈ {−0.1, 0,0.3,0.6}), using X,c = 0, The values of the net benefit and their breakdowns are presented for scenarios with X,e = 0.6 and e = c in Table S3. For each mean difference, the contributions of Y and the net benefit values greatly varied according to correlation. Even when the experimental treatment was inferior to the control in terms of Y , the contributions of Y were positive in scenarios wherein the correlations were negative in both arms (eg, net benefit 40.3%, the contribution of X 31.0%, the contribution of Y 9.3% with Y ,e = −0.1 and e = c = −0.9). Figure 2 shows the net benefit values for each scenario. The impact of correlation in the experimental arm varied depending on the treatment effects and the correlation in the control arm. For example, the net benefit was higher with positive correlations in the experimental arm when X,e = −0.6 (column 1), whereas this impact became unclear when the control correlation was negative with X,e = −0.3 (column 2). When X,e = 0 and Y ,e = 0.3, 0.6, the direction of the impact from the correlation in the experimental arm was reversed depending on the sign of the control correlation (column 3 row 3-4).
F I G U R E 2 Net benefit in the Gaussian-Gaussian case, columns: mean difference at X; rows: mean difference at Y . The lines link the results with the same correlation values for both arms Additionally, we considered four different threshold values for X ( X ∈ {0.2, 0.7,1.0,1.5}) using the scenarios with X,e = 0.6, Y ,e = −0.1, e = c , and Y = 0.5 (Appendix: Table A1). As the threshold value increased, both the win and loss contributions of X decreased, providing the overall contributions of X closer to 0%. This decrease in the contribution of X nevertheless did not necessarily result in the conservatism of the overall net benefit with e = c = −0.9. When X = 1.5, the sums of the win and loss contributions were higher with Y than with X, regardless of the correlation. With e = c = 0.9, most strikingly, this contribution of Y led to the overall net benefit below 0%, despite the superiority of the experimental treatment in terms of the first-priority endpoint.

IMPACT OF CORRELATIONS BETWEEN SURVIVAL AND CATEGORICAL VARIABLES ON THE NET BENEFIT ESTIMATES IN THE PRESENCE OF RIGHT CENSORING
Survival data with right censoring are frequently encountered in real clinical trials. In this section, we investigate the impact of outcome correlations on the net benefit estimates in the presence of right censoring via simulation and application to the data of the JCOG0212 trial. [13][14][15] We used survival endpoint for the first priority and categorical endpoint for the second priority, as has been commonly done in previous applications. [16][17][18] The U-statistics-based inferential method was used for Gehan's and Péron's scoring rule without the correction. 11 The bootstrap-based inferential method was used for Gehan's and Péron's scoring rule with correction. 10 The simulation and application were performed using BuyseTest package version 2.3.11 under R version 4.2.1.

Overview
In Japan, mesorectal excision with lateral lymph node dissection (ME with LLND) has been the standard surgical procedure for lower rectal cancer. Nonetheless, total mesorectal excision (TME) and mesorectal excision (ME alone) have been established as the international standard. The JCOG0212 trial (ClinicalTrials.gov: NCT00190541, UMIN-CTR: C000000034) compared ME alone vs ME with LLND in patients with clinical stage II to III rectal cancer without lateral pelvic lymph node enlargement with a primary endpoint of relapse-free survival (RFS). [13][14][15] Since ME alone was viewed as a less toxic option, the trial primarily sought to confirm the noninferiority of this procedure with a noninferiority margin of hazard ratio 1.34 and 5% one-sided significance level. A total of 701 patients from 33 institutions in Japan were assigned to ME alone or ME with LLND between June 2003 and August 2010.
In the primary analysis at 5 years after enrollment completion, 5-year RFS was 73.3% for ME alone and 73.4% for ME with LLND (hazard ratio 1.07, 90.9% confidence interval [CI] 0.84-1.36, non-inferiority P = .0547). 14 In the final analysis at 7 years after enrollment completion, 7-year RFS was 70⋅7% for ME alone and 71.1% for ME with LLND (hazard ratio 1.09, 95% CI 0.84-1.42, noninferiority P = .0643). 15 Neither the primary analysis nor the final analysis supported the noninferiority of ME alone. Analysis on secondary endpoints, however, suggested the safety benefits of ME alone (eg, grade 3-4 complications: 16% for ME alone and 22% for ME with LLND, Fisher's exact test 2-sided P = .07). 13

Correlations between efficacy and safety
We explored the correlation between efficacy and safety in each arm of JCOG0212 using data on RFS in the final analysis and the worst postoperative complication grade (0, 1, 2, 3, or 4; 4 being the worst). The RFS was defined as the time from randomization to any relapse or death from any cause or to the latest date at which the relapse-free status was confirmed. 14,15 Information on the worst grade was collected about predefined postoperative complications that occurred during the hospital stay. 13 The Kendall's tau between the length of RFS (observed time to event/censoring) and 4 minus the worst complication grade (reversed so higher values indicated more favorable outcomes) was estimated as 0.100 in ME alone and −0.068 in ME with LLND. The frequencies and means of the worst grade by arm and RFS event occurrence are presented in Table 1. The mean of the worst grade was lower in ME alone (1.209) than in ME with LLND (1.413). In ME alone, the mean of   318). These values suggest a positive correlation between long RFS and low complication grade in ME alone and a negative correlation between long RFS and low complication grade in ME with LLND.

Simulation methods
We examined the impact of correlations between survival and categorical variables on the net benefit estimates by simulation mimicking the JCOG0212 trial. We assumed that 350 subjects per arm were enrolled during the 7-year accrual period. The experimental and control treatments in this simulation corresponded to ME alone and ME with LLND in JCOG0212, respectively. Therefore, the experimental treatment was assumed to be less effective and less toxic than the control treatment. We set 7-year survival in the control arm at 71.1% and the hazard ratio for the experimental compared to the control at 1.09. The distributions of the toxicity grade in the experimental and control arms were determined using the observed marginal frequencies in ME alone and ME with LLND (Grade 0:28. . In addition to the analysis using complete data with infinite follow-up, we explored the influence of censoring, assuming that the subjects were censored at 7 or 14 years after the completion of enrollment. Correlations were systematically varied between the survival and toxicity grades.
To create correlated outcomes, bivariate Gaussian variables where e = 1.09 c and c = − log(0.711) 7 . Censoring times were generated assuming that the subjects were uniformly enrolled during the 7-year accrual period and that all of them were censored at 7 or 14 years after the completion of enrollment. The theoretical censoring rates at 7 and 14 years were 57.6% and 39.7% in the experimental arm and 60.2% and 42.8% in the control arm. 19 Categorical toxicity grades were derived by discretizing G * ei and G * cj into 4, 3, 2, 1, and 0 with thresholds Φ −1 (0.003), Φ −1 (0.146), Φ −1 (0.349), and Φ −1 (0.712) for the experimental arm and Φ −1 (0.006), Φ −1 (0.200), Φ −1 (0.433), and Φ −1 (0.775) for the control arm. As a result of this data generation process, the positive correlation at the bivariate Gaussian variables corresponded to the positive correlation between long survival and lower toxicity grade (Table S4). The empirical Kendall's tau between the time to death/censoring and 4 minus toxicity grade was −0.50, −0.23, 0.00, 0.24, and 0.52 in the experimental arm and −0.49, −0.23, 0.00, 0.23, and 0.50 in the control arm with correlation −0.9, −0.45, 0, 0.45, and 0.9 and 7-year follow-up. Therefore, the estimates of Kendall's tau in JCOG0212 (0.100 and −0.068) may be found in the range of 0 < e < 0.45 and −0.45 < c < 0 in the simulation, although other factors such as survival and censoring distributions may have affected this measure differently in the real and simulated datasets.
On each simulated dataset, we estimated the net benefit, win/loss/draw/uninformative contributions of each endpoint, SE, and 95% CI based on Péron's scoring rule. 9,11 We used survival with a threshold of 0, 1, or 2 years for the first priority and the toxicity grade with a threshold of 0 for the second priority. We also obtained the net benefit estimates, SEs, and 95% CIs based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction using the 7-year follow-up data. 10 These processes were repeated 10 000 times. This number of repetitions ensured that the Monte Carlo SE of the coverage probability was ≤ 0.5%. 20 Simulation results were summarized in terms of mean estimates, mean contributions of each endpoint, coverage, square root of the mean estimated variance, and empirical SE for each correlation, follow-up duration, threshold for survival, and scoring rule. The mean estimated variance was computed as the average of the squared SE estimates. The empirical SE was computed as the SD of the point estimates. For each correlation and threshold for survival, we regarded the mean estimates obtained using the complete data (same regardless of the scoring rule) as the true values of the net benefit, as they are theoretically unbiased. The 95% CI coverage was calculated as the proportion of the estimated CIs covering the mean estimate from the complete data.

4.2.2
Simulation results Figure 3 presents the results of mean net benefit estimates based on Péron's scoring rule. The theoretically unbiased estimates from the complete data were not impacted by the correlations with a threshold of 0 (column 3 row 1), whereas the unbiased estimates with a 1-or 2-year threshold were higher when the correlation was negative in the experimental arm and positive in the control arm (column 3 row 2-3). In the presence of right censoring, positive correlations in the experimental arm and negative correlations in the control arm resulted in higher estimates regardless of the threshold value, and these estimates were substantially biased relative to their complete data counterparts (columns 1-2). Although these estimates were already biased upward when the correlations were 0 in both arms, this bias increased further when the correlation was positive in the experimental arm and negative in the control arm (Table S5). The mean estimated uninformative probabilities at the survival endpoint and win/loss/draw probabilities at the toxicity grade were higher with shorter follow-up durations (excerpts presented in Table 2). Figure 4 presents the results of 95% CI coverage based on Péron's scoring rule. These results are presented in tabular form in Table S6. Undercoverage mainly occurred when the correlation signs were opposite between the arms in the presence of right censoring. This undercoverage was most prominent when the correlation was positive in the experimental arm and negative in the control arm (as was the case with JCOG0212) and the follow-up duration was 7 years. Figure 5 presents the results of mean estimates based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction under 7-year follow-up. Compared to the impact of correlation with Péron's scoring rule, the impact with Gehan's scoring was the same in direction but stronger in magnitude. The estimates based on Gehan's scoring rule with correction and Péron's scoring rule with correction were close to the theoretically unbiased estimates. The results of bias compared to their complete data counterparts are presented in Table S7. Figure 6 presents the results of 95% CI coverage based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction under 7-year follow-up. These results are presented in tabular form in Table S8. In most scenarios, Gehan's scoring rule had lower coverage than Péron's scoring rule. Gehan's scoring rule with correction and Péron's scoring rule with correction had coverage close to the nominal level.
Mean estimates, square roots of the mean estimated variances, and empirical SEs using 7-year follow-up data (Gehan, Péron, Gehan with correction, and Péron with correction) and complete data are presented for scenarios with a 1-year threshold in Table S9 . No major difference was observed between the square roots of the mean estimated variances and empirical SEs.

Application methods
Using data from the final analysis of JCOG0212, we estimated the net benefit, its 95% CI, and the contribution of each endpoint based on Péron's scoring rule. 9,11 We used RFS with a threshold of 0, 0.5, 1, or 2 years for the first priority and the worst postoperative complication grade with a threshold of 0 for the second priority. We also obtained the win/loss/draw probabilities for the worst grade without the condition of being a draw or uninformative at RFS. In addition, we estimated the net benefit and its 95% CI based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction. 10 These analyses were conducted to illustrate the benefit-risk assessment based on generalized pairwise comparisons in the presence of outcome correlations.  Depending on the threshold value, 38.3% to 48.2% of the pairs were evaluated based on their complication grades (draw or uninformative at RFS), among which the contribution to the overall net benefit was 6.7% to 7.5%. This contribution is relatively high compared to the result solely based on the worst grade (10.8% out of 100%). Table 4 presents the application results based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction. The estimates varied greatly depending on the scoring rule. For example, with a threshold of 1 year, Gehan's scoring rule yielded the highest estimate of 7.73%, and Péron's scoring rule with correction yielded the lowest estimate of −10.28%.

DISCUSSION
This study investigated the impact of outcome correlations on the net benefit and its estimate under various correlation settings. Our theoretical and numerical analyses revealed that the direction and magnitude of this impact were associated F I G U R E 6 Coverage of the 95% CI based on Gehan's scoring rule, Gehan's scoring rule with correction, and Péron's scoring rule with correction in the simulation with 7-year follow-up, columns: scoring rule; rows: threshold for survival. The lines link the results with the same correlation values for both arms with the outcome distributions, defined by, for example, the probabilities of a favorable outcome and mean differences. Our simulation demonstrated that the impact of correlations on the net benefit estimates varied greatly according to the threshold values, follow-up durations, and scoring rules for censored observations. We also analyzed the JCOG0212 data using generalized pairwise comparisons. According to our theoretical and numerical results, the true net benefit values were impacted by the correlations in various directions depending on the outcome distributions. When two binary endpoints were considered, the direction of the impact of the correlation within each arm was determined by the marginal probabilities of a favorable outcome at the first-priority endpoint in the opposite arm. For example, if the probabilities in both arms are fixed at values below 50%, the net benefit is highest when the correlation is negative in the experimental arm and positive in the control arm (or, more intuitively, shifts to the advantage of treatment with the negative correlation). In such scenarios, positive values were obtained for the overall net benefit even when the experimental treatment worsened both the first-and second-priority outcomes. This result implies that in actual data analysis, a seemingly innocuous change (such as adding an endpoint that was not improved by the experimental treatment but was heavily correlated with other endpoints) could lead to a higher net benefit value and potentially to a different conclusion. Therefore, a detailed a priori specification of the TA B L E 3 Application results based on (A) RFS (Péron's scoring rule; threshold: 0, 0.5, 1, and 2 years) and the worst complication grade analysis plan is particularly important in regard to generalized pairwise comparisons. In addition, the contribution of each endpoint should always be presented and interpreted alongside the overall net benefit based on the multiple prioritized outcomes. These individual contributions will also be useful for checking if the thresholds for meaningful differences in higher-priority endpoints were sufficiently low to allocate greater weight to more important outcomes. In our simulation, the impact of correlations between survival and categorical variables on the theoretically unbiased estimates with infinite follow-up showed different trends according to the threshold values for survival. Correlations had no impact with a threshold of 0, which is theoretically expected because all pairs were either win or loss if a threshold of 0 was set for a continuous outcome. The unbiased estimates with a 1-or 2-year threshold were higher when the correlation was negative in the experimental arm and positive in the control arm. The negative correlation in the simulation meant that subjects with short survival had lower toxicity grades. Because the exponentially distributed time to death was denser among subjects with short survival, the toxicity results of subjects with short survival may have had a greater influence on these values. This attribute can vary depending on the shape of the survival distributions and treatment effects, as suggested by our results in the Gaussian-Gaussian case.
In simulation scenarios with limited follow-up, the estimates based on Péron's scoring rule were higher when the correlation was positive in the experimental arm and negative in the control arm, regardless of the threshold value. The positive correlation in the simulation meant that subjects with long survival had lower toxicity grades. Because the subjects with long survival were more likely to be censored and thus more likely to be evaluated based on the toxicity grade, the toxicity results of subjects with long survival may have had a greater influence on those estimates. Consequently, those estimates were biased in favor of positive correlations relative to their complete data counterparts.
The undercoverage was most prominent with simulation scenarios wherein correlation was positive in the experimental arm and negative in the control arm (as was the case with JCOG0212) and the follow-up was for 7 years. In such scenarios, the net benefit estimates based on Péron's scoring rule were upward biased for two reasons. First, as discussed above, the estimates were biased in favor of positive correlations in the presence of right censoring. Second, the portion of the score determined based on the toxicity grade increased with shorter follow-up, giving an advantage to the experimental arm, which on average had lower toxicity grades. The bias from the true net benefit values was considered to be the main cause of this undercoverage.
The choice of scoring rule for censored observations altered the impact of correlations in our simulation. Compared to Péron's scoring rule, Gehan's scoring rule exhibited greater bias and lower coverage, probably because it performs no compensation for the censored observations. The recently proposed correction remarkably decreased the bias and improved the coverage. The correction imputes the contribution of the uninformative based on the average informative scores, erasing the tendency where censored subjects were more likely to be scored based on the toxicity grade. The original article stated that its underlying assumption (ie, that the true score for the uninformative pairs would be, on average, equal to the score for the informative pairs) could be "open to criticism" in the presence of heavy censoring. 10 However, our simulation results suggest that this correction may, in fact, be more meaningful in the presence of heavy censoring. Therefore, the use of this correction method would be preferable when the first-priority survival outcome is censored for some of the study subjects.
In the application to JCOG0212 data, the net benefit estimates for ME alone based on Péron's scoring rule exceeded 0% with 1-and 2-year thresholds. However, roughly 68% of the subjects were censored in the study, and these estimates and their CIs could be unreliable. These estimates were considered upward biased based on our simulation and detailed analysis of the data. First, the correlation was positive in ME alone and negative in ME with LLND, and according to our simulation results, the estimates with limited follow-up may be biased in favor of positive correlations. The relative increase in the contribution of the worst grade among pairs whose scores were based on the worst grade during the net benefit estimation also suggests that the correlations gave an edge to ME alone. Second, the uninformative probabilities for RFS were approximately 40%, and that portion of the score was determined based on the worst grades, where ME alone was superior. Furthermore, Gehan's scoring with correction and Péron's scoring with correction yielded lower estimates, which may be closer to the true values considering our simulation results. Our application results support the conclusion of the trial that ME with LLND should remain the standard treatment in Japan. 14,15 Other authors have explored the impact of correlations on the net benefit and its estimate. [6][7][8] We confirmed the previous result that the net benefit was higher with a negative correlation between the binary efficacy (first priority) and binary safety (second priority) outcomes (ie, positive correlation between efficacy and toxicity) in the experimental arm when less than 50% of the control subjects experienced efficacy. 6 However, our theoretical analysis revealed that the impact of correlation within the experimental arm was reversed with a threshold for control efficacy of 50%. In another study, the net benefit estimates were higher with negative correlations than with positive correlations when the correlation settings were common in both arms. 8 Based on our numerical examples and simulation results, there seems to be no simple rule to determine the direction of the impact if the correlations in both arms move simultaneously. Overall, the impact of correlations on the net benefit and its estimate may be more complex and require more careful attention than has been previously recognized. Further theoretical and simulation research using various clinical settings will help expand the knowledge about the theoretical and operational characteristics of the net benefit and the impact of correlations.
The results of generalized pairwise comparisons may be summarized as the win ratio or win odds. 21,22 As long as they share the win/loss/draw probabilities, the correlations may have a similar impact on any measure based on generalized pairwise comparisons. Caution is also required when interpreting the win ratio or win odds. The pairwise-comparison methods using non-prioritized outcomes are not impacted by the correlations. 5,23,24 If the purpose of the analysis is a comprehensive evaluation of benefits and risks, and consideration of the correlations is not essential, the methods without prioritization may be preferred. To consider the importance of endpoints, the methods of weight allocation could be explored. 5 Some limitations of our simulation settings should be considered when interpreting the findings. First, we assumed that the survival times were exponentially distributed and that the hazards were constant over time, which may not be the case with real data. As discussed above, the impact of correlations on the true net benefit values depends on various factors such as the shape of the distributions and treatment effects. Nonetheless, the cause of the bias with limited follow-up was considered independent from the outcome distributions. Second, we assumed uniform recruitment and no loss to follow-up when generating the censoring times. With random censoring under infinite follow-up, Péron's scoring rules will work well and their estimates may be less biased. 9,10 However, limited follow-up is typical in clinical trials, and the potential bias caused by this type of censoring should be carefully evaluated.
In conclusion, the net benefit was impacted by the correlations in various directions depending on the outcome distributions, and its estimate based on Gehan's or Péron's scoring rule was biased in favor of positive correlations in the presence of right censoring. The impact of correlations should be carefully considered when interpreting the net benefit and its estimate. With binary endpoints, we presented a general rule for the direction of the impact, which was related to the probabilities of a favorable outcome at the first priority endpoint. We also showed the usefulness of the recently proposed correction method for the censored survival data, 10 which greatly reduced the estimation bias even in the presence of strong outcome correlations.

ACKNOWLEDGMENTS
The analysis of JCOG0212 data was approved by the ethics committee of the Interfaculty Initiative in Information Studies, the University of Tokyo.

DATA AVAILABILITY STATEMENT
The codes used to generate the main results of numerical computation and simulation are available in Appendix S2 and S3. The JCOG0212 data belongs to the Japan Clinical Oncology Group and is not publicly available. The codes used to analyze the JCOG0212 data are available from the corresponding author on reasonable request.

SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.