Covariate-adjusted measures of discrimination for survival data

Motivation Discrimination statistics describe the ability of a survival model to assign higher risks to individuals who experience earlier events: examples are Harrell’s C-index and Royston and Sauerbrei’s D, which we call the D-index. Prognostic covariates whose distributions are controlled by the study design (e.g. age and sex) influence discrimination and can make it difficult to compare model discrimination between studies. Although covariate adjustment is a standard procedure for quantifying disease-risk factor associations, there are no covariate adjustment methods for discrimination statistics in censored survival data. Objective To develop extensions of the C-index and D-index that describe the prognostic ability of a model adjusted for one or more covariate(s). Method We define a covariate-adjusted C-index and D-index for censored survival data, propose several estimators, and investigate their performance in simulation studies and in data from a large individual participant data meta-analysis, the Emerging Risk Factors Collaboration. Results The proposed methods perform well in simulations. In the Emerging Risk Factors Collaboration data, the age-adjusted C-index and D-index were substantially smaller than unadjusted values. The study-specific standard deviation of baseline age was strongly associated with the unadjusted C-index and D-index but not significantly associated with the age-adjusted indices. Conclusions The proposed estimators improve meta-analysis comparisons, are easy to implement and give a more meaningful clinical interpretation.


Introduction
A fundamental property of a prognostic marker is its ability to discriminate high from low risk patients (Hlatky et al., 2009). Markers that do not improve discrimination are also unlikely to improve other measures of clinical performance (Mihaescu et al., 2010). The discrimination performance of a marker and its incremental value can vary significantly across different studies. Variation beyond chance can be attributed to differences between studies either in the strength of association between the marker and the outcome or in the marker's distribution (Pepe et al., 2004), or to a range of other possible biases that relate to the conduct and recording in a study (Lijmer et al., 2002). In practice, markers are often used in combination, so interest lies in evaluating the discrimination of a prognostic model including one or more markers together with demographic variables. Other important aspects of prognostic ability include calibration.
For a binary outcome, the standard description of discrimination is the receiver operating characteristic (ROC) curve, which displays the trade-off between specificity (the probability that a control marker value is below the cut-off) and sensitivity (the probability that a case marker value is above the cut-off) for different marker cut-offs. The ROC curve is often summarized by the area under the curve (AUC), also called the C-statistic, which can be interpreted as the probability that the marker will correctly classify a randomly chosen pair of patients as case and noncase (Hanley and McNeil, 1982). The AUC ranges from 0 (when all predictions are wrong) and 1 (perfect predictions) with 0.5 representing the average discriminative ability of random predictions. Values below 0.5 are rarely seen other than due to small sample variation.
For a survival outcome, many measures of prognostic ability have been proposed (Choodari-Oskooei et al., 2012a, 2012b. The C-statistic can be used to measure discrimination in this setting, taking the binary outcome to be survival to a fixed follow-up time (Chambless and Diao, 2006). The C-index (Harrell et al., 1982) extends the C-statistic and avoids specifying a fixed follow-up time: it estimates the probability that given two randomly drawn patients, the patient who has an event first is predicted a higher risk. Royston and Sauerbrei (2004) proposed an alternative measure, D, which we call the D-index: it is based on a proportional hazards model and has the interpretation of an average log hazard ratio between an individual in the upper half of the risk distribution and an individual in the lower half (Pennells et al., 2014). We use the C-index because it is the most widely used measure in practice (Mallett et al., 2010), and the D-index because it adapts well to the purposes of this paper; the two measures give similar conclusions when used to evaluate the discrimination added by a new marker (Fibrinogen Studies Collaboration, 2009).
Covariate adjustment is necessary for correct assessment of disease-risk factor associations in observational studies, but its importance is rarely acknowledged in assessing discrimination. A particular issue is that covariates that form part of the study design such as age and sex can impact substantially on the prognostic ability of a model (Janes and Pepe, 2008;Kerr and Pepe, 2011): for example, the prognostic ability of a cardiovascular risk model (which includes age, sex, clinical covariates, and biomarkers) is likely to be substantially larger in a study that recruited men and women aged 40-80 than in a similar never), systolic blood pressure (SBP), total cholesterol (TC), high-density lipoprotein (HDL) cholesterol and history of diabetes at the baseline survey. Individual participant data were further restricted to subjects aged at least 40 years at baseline with the above risk factors recorded, no known history of CVD at the baseline survey, no recorded history of diabetes, and not known to be under statin treatment. Thus, our analysis data comprised 349,137 individuals from 82 studies (of which eight were clinical trials), of whom 24,369 experienced a CVD event.
The model fitted to these data was the Cox proportional hazards model stratified by study and sex. Studies that were randomized trials were additionally stratified by trial arm. Thus, for individual i in stratum s, the hazard at time t is (1) where x i is the vector of covariates (age, smoking status, SBP, TC, and HDL cholesterol) for individual i, β is a vector of corresponding regression coefficients (assumed constant across strata), and h 0s (t) is the baseline hazard at time t for individuals of stratum s. Table 1 summarizes the data from these 82 studies and the fitted Cox model.
The within-study distribution of baseline age and sex is determined by a study design and hence differs between studies. This may affect measures of discrimination. Model (1) is stratified by sex, so standard calculations automatically stratify measures of discrimination for sex. We therefore focus on the effect of baseline age. The left-hand panel in Fig. 1 plots the C-index for each study (computed as described in Section 6) against the within-study standard deviation (SD) of baseline age. Studies with more variation in baseline age tend to have substantially larger C-indices. The age-adjusted C-index, introduced in Section 4 below, is plotted in the right-hand panel, and shows no association with variation in baseline age.

Notation
We initially work in a single dataset of n individuals. For each individual i = 1,…, n, we assume that the model covariates are x i (scalar or vector), the true event time (in the absence of censoring) is , and censoring time is c i , so the event indicator is and the observed time is . The observed data are (x i , d i , t i ). Censoring is assumed to be noninformative.
We also assume that the risk score is (scalar) r(x i ), which is typically (but not necessarily) the linear predictor βxx from fitting a survival model such as h(t) = h 0 (t) exp (β x x) where β x are coefficients and h 0 (t) is the baseline hazard. Our aim is to evaluate the discrimination of r(x i ). Harrell et al. (1982) defined the C-index as a statistic measuring the degree to which sample pairs are concordant, where concordance occurs if the individual of higher predicted risk has the first event in the pair. This statistic is affected by censoring (see Section 3.2.1). The underlying estimand was defined by Heagerty and Zheng (2005) and Uno et al. (2011) as . Gonen and Heller (2005) instead stated the estimand . These estimands are equivalent in the absence of ties in r(x i ) (i.e., if all individuals have different values of r(x i )). We believe that ties in r(x i ) are important, since poorly discriminating models may have many ties, so it is important to account for them. Heagerty and Zheng's C counts pairs tied on r(x i ) as discordant, while K double-counts them (because they satisfy both r(x i ) ≥ r(x j ) and r(x j ) ≥ r(x i )). Instead, we make the natural definition of the C-index as (2) for a random pair (i, j), where 1(a) = 1 if a is true and 0 if a is false.

C-index
Pairs with tied event times are excluded from all calculations based on C ij , so that the estimand becomes . For simplicity, however, we ignore tied event times in the notation throughout this article.
We now consider various estimators of C in Eq. (2).

Harrell's estimator-Estimation
of C is complicated by the presence of censoring, because we do not know whether for pairs where the first event time is censored. Harrell et al. (1996) proposed estimating C as the mean of C ij over informative pairs, where pair (i, j) is informative if and d i = 1 or and d j = 1: that is, if the first event in the pair is observed. Harrell's estimator is often written as where #concordant counts pairs with and r(x i ) > r(x j ), or and r(x i ) < r(x j ); #tied counts pairs with r(x i ) = r(x j ); and #discordant counts pairs with and r(x i ) < r(x j ), or and r(x i ) > r(x j ). However, the informative pairs are not representative of all pairsfor example, a pair of low-risk individuals is likely to have no event and hence be noninformative-and this can cause bias in Ĉ Har (Gonen and Heller, 2005). Gonen and Heller (2005) proposed an alternative estimator to avoid bias due to censoring. To present the idea in greater generality, suppose r * (x i ) is a linear predictor from a correctly specified proportional hazards model. Then r * (x i ) − r * (x j ) represents the log hazard ratio between individuals i and j, and the probability that individuals i and j are concordant is expit {r * (

Gonen and Heller's estimator-
, and 0.5 if r(x i ) = r(x j )). Then the estimator is the average of this concordance probability, which can be written as (3) Gonen and Heller (2005) considered the special case r * (x i ) = r(x i ) (that is, they assumed that r(x i ) is a linear predictor from a correctly specified proportional hazards model) giving the simpler expression (4) Ĉ ind is an indirect measure, since it does not use the event times and relies on correct model specification.

Restricted C-index-Let
The study only gives information about discrimination at time t ≤ τ, and C can only be estimated by (implicitly) extrapolating to times t > τ: for example, Ĉ ind assumes that the proportional hazards model continues to hold at times beyond τ. To avoid extrapolation, Heagerty and Zheng (2005) proposed the restricted C-index which is estimable without extrapolation in a study with follow-up at least up to time τ.
They and Uno et al. (2011) proposed estimators of C τ to account for censoring before time τ : that of Uno et al. (2011) involves a weighted mean of C ij over informative pairs, where the weight for pair Royston and Sauerbrei (2004) proposed a measure, D, which we also call the D-index, with the interpretation of the log hazard ratio between two equal-sized prognostic groups. It is estimated in a two-stage procedure. In stage 1, the values of r(x i ) are ranked, converted to normal scores, and multiplied by . In stage 2, a proportional hazards regression is performed on the scaled normal scores, and D is the regression coefficient.

D-index
As in the work of Harrell et al. (1982), the estimand is not immediately clear. A possible estimand is based on pairs: still assuming that r * (x i ) − r * (x j ) is the true log hazard ratio between individuals i and j, the estimand D can be defined as the average of this log hazard ratio when i is drawn randomly from the upper half of the risk distribution and j is drawn randomly from the lower half (Pennells et al., 2014): that is, (5) where r̄ is the mean of the r(x i ). The algorithm above clearly estimates this estimand consistently when the proportional hazards model is correctly specified and r(x i ) is normally distributed, but it may be biased when r(x i ) is skewed (Choodari-Oskooei et al., 2012a).

Measures of discrimination with covariate adjustment
Let z i be covariates, which may or may not form part of x i . We aim to evaluate the risk score r(x i ) while adjusting for the covariates z i . Conceptually, we want to estimate C and D if we had a sample with a common value of z, or by restricting attention to pairs with equal values of z. In the ERFC data of Section 2, r(x i ) is a cardiovascular risk prediction while z i is age.

Estimand-Covariate adjustment can be defined by considering pairs that match exactly on Z, so that
is a z-specific C. For situations where C(z) is roughly constant over z, or where a summary measure of discrimination is required, the z-adjusted C is the natural measure when a risk model is stratified by z, and can be applied more widely.
Note that for continuous z with density f(z), pairs matching on z have density proportional to f(z) 2 , so that (6) It is natural to consider weighting by f(z) rather than f(z) 2 in (6), so we also define although other choices of weights are also possible. Of course, C adj,w = C adj if C(z) is constant.

Direct estimation for categorical Z-We describe an estimator as direct (like Ĉ Har ) if it uses actual event times, and indirect (like Ĉ ind ) if instead it uses risks predicted
under a model. Direct estimation is tricky with continuous z, as there may be few or no matching pairs (Section 4.1.3). We therefore first consider the case with categorical z.
In the presence of censoring, these sums are restricted to informative pairs, and the weighting scheme of Uno et al. (2011) may be used to handle random censoring. (7) where m(x i , z i ) is uncorrelated with z i . This is easily done by fitting a suitable regression for r(x i ) on z i , and defining r(z i ) = E[r(x i )|z i ] as the fitted value and m(x i , z i ) as the residual. Conceptually, we want to estimate the discrimination that is due to m(x i , z i ). The methods proposed in this section do not assume that m(x i , z i ) is independent of z i , unlike the methods proposed in Section 4.1.4. We propose direct estimation by plotting C ij against r(z j ) − r(z i ) or |r(z j ) − r(z i )|, fitting a suitable model (parametric or nonparametric), and taking as the fitted value at r(z j ) − r(z i ) = 0. In order to automate the procedure, we use a logistic regression of C ij on (r(z j ) − r(z i )) 2 with weights (8) where λ controls the amount of smoothing.

Direct estimation for continuous or multivariate Z-For some methods, it is helpful to decompose
is then the inverse logit of the estimated intercept. The procedure is illustrated in Supporting Information Fig. S1.
Again, in the presence of censoring, the sums are restricted to informative pairs, and the weighting scheme of Uno et al. (2011) may be used to handle random censoring.
We estimate Ĉ adj,w using the same logistic regression but with weights w 1 (z i , z j )w 2 (z i , z j ) where (9) since this weight approximates 1/f (z) when z i ≈ z j . Here, f(z) might be a kernel estimate of the density of z.
The above method is based on C ij which represents whether two events occur in the order predicted by r(x). An alternative is to explore whether events occur in the order predicted by m(x, z). We define "m-concordance" by replacing conditions r(x i ) > r(x j ) etc. in Eq. (2) with m(x i , z i ) > m(x j , z j ) etc. Again, fitted values of the mean of at r(z j ) − r(z i ) = 0 give an estimator of Ĉ adj , which we denote by .
Comparing the two estimators and may help to detect an unsuitable value of λ. Too small a value causes bias by giving too much weight to mismatched pairs, while too large a value causes large variance by reducing the effective number of pairs used. We used the ERFC data to compare with (Supporting Information Fig. S2) and to compare their standard errors (Supporting Information Fig. S3), for 0 ≤ λ ≤ 10. Values λ < 1 tended to give large differences between the two estimators, but values in the range 1-10 seemed broadly reasonable: later work uses λ = 3.

Indirect estimation of an approximate estimand-Now
we use the correctly specified linear predictor r * (x), which we decompose as r * (x i ) = m * (x i , z i ) + r̂*(z i ) as in (7).
This suggests defining a new estimand (10) C adj* = C adj if m(x i , z i ) is independent of z i . Appendix B and the simulation study demonstrate that C adj* ≠ C adj in general, but differences are not large.
Analogous to (3), we propose the indirect estimator in the correctly specified case m * (x, z) = m(x, z): (11) Like Ĉ ind , this estimator is unaffected by censoring, but requires correct model specification.

4.1.5
Recalibrating-To be useful in practice, a risk score must be well calibrated. Ideally, this is ensured by recalibrating the model in an external validation set. However, sometimes miscalibrated risk scores are evaluated, and in this case we want to be sure that the miscalibration does not distort the C-index.
The advantage of a direct method is that it should give correct results if the risk score is miscalibrated. The indirect methods above are very susceptible to miscalibration. However, even the direct methods of Section 4.1.3 are slightly affected by miscalibration, because the weights in (8) are affected. We therefore propose preceding all the above methods, except for Harrell's method (which is unaffected by miscalibration), by a recalibration step.
For the unadjusted indirect method, we assume r * (x) = γ r r(x) and estimate γ r by fitting the

Cox model
If r(x) is well calibrated, then γr ≈ 1. The recalibrated estimate is (12) For the adjusted methods, we assume m * (x, z) = γ m m(x, z) and r̂*(x) = γ r̂r̂( x) and estimate γ m and γ z by fitting the Cox model with fixed m(x i , z i ) and r(z i ). The recalibrated estimate is (13) Definitions (12) and (13) allow for negative values of γr and γm, which could arise with a very poorly calibrated model, and would correctly give estimates less than 0.5.
If r(x) is the linear predictor from fitting a Cox model to the data, then recalibration as proposed above is pointless: if done, it yields γm = γẑ = 1. However, the values of γ r and γ m in (12) and (13) could instead be estimated by shrinkage methods (Copas, 1983;van Houwelingen and Le Cessie, 1990).

Adjusted D
We define covariate-adjusted D by extending estimand (5) proposed in Section 3. First, zspecific D is recalling that r(z) is the z-specific mean of the r(x i ). We can also write D(z) as and so it is natural to define adjusted D as White and Rapsomaniki Page 10 That is, D adj is the average log hazard ratio between individuals matched on z who are above-average and below-average for their value of z.
We propose the following modification to the estimation algorithm for D adj given in Section 3.3. In stage 1, instead of ranking the r(x i ), we rank the m(x i , z i ) across the whole sample, form normal scores, and scale by . In stage 2, the proportional hazards regression on the scaled normal scores is adjusted for z to avoid bias from omitting a prognostic covariate (Ford et al., 1995). D̂a dj is the coefficient of the scaled normal scores in the stage 2 model.

Stratification
A stratified version of C adj may be computed by restricting attention to pairs within strata. For stratified versions of C adj* and D adj we replace m(x i , z i ) and r(z i ) above with m( This decomposition is performed by regressing r(x i , s i ) on z i within strata. In estimating D adj , the stage 2 proportional hazards regression is stratified by s.

Simulation study
We next explore the performance (bias and precision) of the proposed estimators as we vary the strengths of association of the outcome with x and z. We first consider an ideal setting where r(x) is the linear predictor from a correctly specified Cox model, and m(x i , z i ) is independent of z i so that C adj = C adj* . We then consider a nonideal setting where var(m(x i , z i )) depends on z i , so that estimands C adj , C adj,w , and C adj* potentially differ.

Data generating model
Data sets of size n = 1000 were generated with covariates x = (v, z). This relatively large sample size was designed to make optimism negligible without excessively increasing computing time for C (which is roughly proportional to n 2 ).
Covariate z represents age at baseline. A fraction 1 − ϕ of individuals belong to age group 1 and have z ~ U(40, 50), the uniform distribution from 40 to 50. The remaining fraction ϕ of individuals belong to age group 2 and have z ~ U(50, 60). Covariate v represents the biomarker of interest and was drawn as v = α(z − 50) + u with for an individual whose value of z places them in age group g. Settings for σ g are given below. We chose α to make corr(v, z) = 0.25: changing corr(v, z) to 0 or 0.5 affected the unadjusted results for both C and D, but had negligible effect on the adjusted results (results not shown).
Survival times were drawn from the Gompertz distribution where t is time in years from baseline. We took β v = 0, 0.5, 1, and β z = 0, 0.1, 0.2. Follow-up was for 15 years, and h 0 was chosen to give 50-70% censoring. With this data generating model, In simulation 1, we take ϕ = 0.5, so that z ~ U(40, 60) and weighting does not affect the estimands. We also take σ 1 = σ 2 = 1, so that r(z) and m(x, z) are independent, and the estimands C adj and C adj* are equal. In simulation 2, we take ϕ = 2/3, in order to explore weighting, and σ 1 = 1 and σ 2 = 2, so that var(m(x, z)|z) depends on z and the estimands differ.
For D, we used unadjusted D̂ and adjusted D̂a dj . The methods are summarized in Table 2 and the top half of Table 3.

Simulation scheme
A total of 1000 datasets were drawn for each combination of parameters. For each combination of parameters and each method, we computed C̄ and s C , the mean and standard deviation of Ĉ, and we present a forest-type plot showing C̄ with an interval constructed as C̄ ± 1.96s C . We computed the true values of C and D using the exact methods described in appendices C and D. Source code to reproduce the results is available as Supporting Information on the journal's web page (http://onlinelibrary.wiley.com/doi/10.1002/bimj. 201400061/suppinfo). Figure 2 shows results with corr(v, z) = 0.25: the five panels show different combinations of β v and β z .

Results for simulation 1
Considering the unadjusted results, Ĉ Har has a slightly larger mean than Ĉ ind in all panels, indicating small bias due to censoring.
Comparing the unadjusted and adjusted estimates, we see that they are similar in the second panel where β z = 0 (i.e., where there is no covariate effect to adjust for), but markedly different in the other panels where β z > 0.
We now compare the different adjustment methods in the first panel where β v = 0 so that the true value is C adj = 0.5. The "smooth 1" methods (unweighted and weighted) show substantial bias. This is likely to have arisen because concordance C ij is strongly related to r( z j ) − r(z i ) and the smoothing method inadequately allows for this association. The other adjusted methods show small positive bias. This is attributable to optimism, since the models are fitted and evaluated in the same data; however, optimism is small because of our large sample size. We therefore suggest that "smooth 2" may be preferable to "smooth 1".
In the other panels where C adj > 0.5, the indirect estimator appeared unbiased (suggesting the bias from optimism is negligible), while the smoothing estimators had small positive bias, presumably due to censoring.
Results for adjusted D similarly suggest small optimism when β v = 0 and little or no bias elsewhere (Fig. 3).

Results for simulation 2
Results for C are shown in Fig. 4. When β v = 0 (top panel), the results are very similar to simulation 1. In the other panels, the estimands C adj (estimated by the unweighted methods), C adj,w (estimated by the weighted methods), and C adj* (estimated by the indirect method) are unequal and are shown by three vertical lines. These estimands differ by up to 0.014, with C adj,w < C adj* < C adj .
remains unbiased for C adj* . The smoothing estimators are all positively biased, with weighted estimators on average slightly smaller than unweighted estimators. In the second panel, where unadjusted and adjusted C-indices are equal, the bias in the smoothing estimators is slightly smaller than that in Harrell's estimator: this suggests that all the bias observed is attributable to censoring.
Corresponding results for D are shown in Fig. 5. Small positive bias is found for all parameter values. Because no bias was seen in simulation 1, this is likely to arise from model mis-specification.

ERFC results
To illustrate the differences between methods and the effects of recalibration, we used the single Cox proportional hazards model fitted to all the ERFC studies, stratifying by study, sex, and trial arm, as displayed in Table 1. The linear predictor from the resulting model was evaluated using unweighted methods in each study separately, stratifying by sex and trial arm.
We first explore the effect of recalibration. The model is guaranteed to be well calibrated in the whole ERFC data, but it is likely to be miscalibrated in individual studies. Figure 6 plots, for each method considered, the difference between the C-indices after recalibration and before recalibration against their mean, as proposed by Bland and Altman (1986). Harrell's method is unaffected by recalibration and so is not shown in Fig. 6. We see that recalibration has a large impact for the indirect methods and very little impact for the smoothing methods.
We next compare the different methods after recalibration. Figure 7 shows Bland-Altman plots comparing the two unadjusted methods and the three adjusted methods. The top panel shows that the indirect method tends to give lower results than Harrell's method, probably due to censoring (Gonen and Heller, 2005). The four panels in the lower left-hand corner compare unadjusted and adjusted methods and show large differences. The three panels in the lower right-hand corner compare the adjusted methods. Again, the indirect method tends to give lower estimates than the other methods, while the two smoothing methods give very similar results.
Finally, we revisit Fig. 1, which used the indirect method with recalibration. The strong association of the unadjusted C-index with SD of baseline age (left-hand panel) is removed when we use the age-adjusted C-index (right-hand panel). Covariate adjustment reduces C by up to 0.21 in 77 of the 82 studies; increases C by up to 0.03 in four studies (all of which are small); and leaves C unchanged for one study where all participants have the same baseline age. Figure 8 shows the corresponding results for the D-index.

Discussion
We have proposed a number of methods for estimating an adjusted C-index. The lower part of Table 3 lists a number of desirable properties of an adjusted measure of discrimination, and evaluates the proposed adjusted C-indices and the adjusted D-index against these measures. Overall, the best adjusted C-index appears to be the indirect estimator, although we caution that it is sensitive to model mis-specification. The adjusted D-index is an excellent alternative which is easy to compute.
We have not discussed computation of standard errors in this paper: for the C-index, bootstrapping could be used, while a standard error arises naturally in the calculation of the D-index.
Optimism (overfitting) is an issue whenever models are estimated and evaluated on the same dataset (Harrell et al., 1996). It is not the focus of this paper, because the ERFC data set is large and optimism is likely to be negligible. However, there were signs of optimism in the simulation studies with β v = 0. In general, our methods should be applied after recalibrating the risk score r(x) to allow for optimism, ideally in an external validation set, or otherwise by using internal corrections such as the bootstrap (Harrell et al., 1996).
Censoring is another potential problem for our methods. It causes bias in direct methods if noninformative pairs are simply excluded. Our simulation results show that moderate biases due to censoring do occur, especially in larger C-indices (e.g., in the bottom two panels of Fig. 4). Typically, studies have both a limit τ to the length of follow-up and random censoring before that time. The method of Uno et al. (2011) can be used to correct for the random censoring, but it estimates C τ not C. Currently the only way to estimate C with data censored by end of follow-up is the indirect method.
Model mis-specification is a further potential problem, especially with the indirect methods which assume a correctly specified proportional hazards model. We have proposed a recalibration step which should remove bias due to miscalibration, but not necessarily other forms of model mis-specification. The direct methods such as Smooth 1 and Smooth 2 should be much less sensitive to model mis-specification, since they depend on observed concordance, not model-predicted concordance.
To reduce sensitivity to model mis-specification, we tried using the difference between the direct and indirect estimators of C (which may reflect the impact of model mis-specification) to correct the indirect estimator of C adj* , defining (or the equivalent on the logit scale). However, because the impact of censoring is greater in unadjusted than adjusted estimators (Fig. 2), the corrected estimator did not perform well in the simulation study and was not included in the results.
Covariate adjustment could be considered for other measures of discrimination. The net reclassification index (NRI) is a popular measure of the difference in discrimination between two models (Pencina et al., 2008). Because the NRI is based on within-individual comparisons, it is neither necessary nor possible to adjust it for covariates, although a covariate-specific NRI could be a useful quantity. However, correct calibration is required to avoid misleading NRI results (Hilden and Gerds, 2014). Another way to evaluate the value of adding a new biomarker to a risk prediction model could be to evaluate the discrimination of the new model, adjusting for the covariates in the original model.
Further extensions include ways to account for competing risks and to obtain timedependent measures of discrimination (Wolbers et al., 2009). A common feature of these approaches is that the C-index needs to be computed for different time points which limit the comparability of different studies that have different length of follow-up. An open question is which metric is more appropriate for which data and whether these different approaches can produce different conclusions in some scenarios. The methodology presented here could be extended to incorporate these extensions.
Our approach should not be confused with ROC regression (Tosteson and Begg, 1988). ROC regression methods model the accuracy of a diagnostic test as a function of covariates, not how the disease is associated with covariates. A possible use of such an approach might be to find subgroups where the marker should not be used or to find optimal cutoffs. Here we assume predictions that have been optimized with respect to disease risk and our aim is not (primarily) to explain which covariates affect accuracy but to adjust discrimination statistics for the confounding effect of the covariates that do.
In summary, we have proposed covariate-adjusted measures of concordance. There are many benefits to such measures . In the meta-analysis setting, they facilitate comparisons between studies with different covariate distributions ( Figs. 1 and 8). They also enable matched case-control studies nested within cohort studies to be compared with standard cohort studies, since the former can only yield measures of discrimination adjusted for the matching variables. We advocate adjustment, at least for study design variables such as age, sex, and study centre, whenever measures of discrimination are to be compared between studies with different distributions of the design variables. Suppose X = (V, Z) where V and Z are binary with p(Z = 1) = 0.5, p(V = 1|Z = 0) = π 0 = 0.2 and p(V = 1|Z = 1) = π 1 = 0.8. Suppose h i (t) = h 0 (t) exp(V + Z) so r(X) = V + Z. We want to adjust for Z so m(X, Z) = V − E [V|Z]. When Z i = Z j = z, |m(X i , Z i ) − m(X j , Z j )| is 1 with probability 2π z (1 − π z ) and 0 otherwise, and this distribution is independent of z since π 1 = 1 − π 0 . So C(z = 0) = C(z = 1) = C adj = 0.574. But for pairs with Z i ≠ Z j , |m(X i , Z i ) − m(X j , Z j )| has a different distribution (but the same SD). As a result, C adj* = 0.598 ≠ C adj .

C True values of C(z), Cadj, and Cadj* in the simulation study
For simplicity we assume β v ≥ 0. We also assume that the model is correctly specified. We first compute C(z), for which the distribution of z is not needed. Define I(σ 2 ) = E[expit {|A|}] when A ~ N(0, σ 2 ). Then where ϕ(u) is the standard normal density function.

D True value of Dadj
We have D = E[r(x i ) − r(x j )|z i = z j , m(x i , z i ) < 0 < m(x j , z j )]. We again assume that the model is correctly specified. Since m(  ERFC data: unadjusted and age-adjusted C-index for a model including baseline age, smoking, SBP, TC, and HDL, plotted for each study against the SD of baseline age in that study. Analyses are stratified by sex and trial arm. Each point represents one study.   ERFC data: unadjusted and age-adjusted D-index for a model including baseline age, smoking, SBP, TC, and HDL, plotted for each study against the SD of baseline age in that study. Analyses are stratified by sex and trial arm. Each point represents one study.  Summary of methods for unadjusted measures of discrimination.

Harrell's C Indirect
Notation Ĉ Har Ĉ ind DD escription Mean concordance among informative pairs Mean expected concordance a) Cox model on scaled rankit of risk score Quantity estimated C C D

Recalibration
No impact Needed Implicit in method a) Expected concordance is computed assuming that r(x) is the linear predictor in a correctly specified Cox model.