Statistical inference for Cox proportional hazards models with a diverging number of covariates

For statistical inference on regression models with a diverging number of covariates, the existing literature typically makes sparsity assumptions on the inverse of the Fisher information matrix. Such assumptions, however, are often violated under Cox proportion hazards models, leading to biased estimates with under‐coverage confidence intervals. We propose a modified debiased lasso method, which solves a series of quadratic programming problems to approximate the inverse information matrix without posing sparse matrix assumptions. We establish asymptotic results for the estimated regression coefficients when the dimension of covariates diverges with the sample size. As demonstrated by extensive simulations, our proposed method provides consistent estimates and confidence intervals with nominal coverage probabilities. The utility of the method is further demonstrated by assessing the effects of genetic markers on patients' overall survival with the Boston Lung Cancer Survival Cohort, a large‐scale epidemiology study investigating mechanisms underlying the lung cancer.


Introduction
The Cox proportional hazards model (Cox, 1972), a semiparametric model with an unspecified baseline hazard function, has been widely used for the analysis of censored time-to-event data.With a fixed dimension of covariates, Cox (1972) proposed the maximum partial likelihood estimation (MPLE) to infer the regression coefficients, and Andersen and Gill (1982) proved the asymptotic distributional results for MPLE using the Martingale theory.
Technological advances nowadays have made it possible to collect a large amount of information in biomedical studies.For example, the Boston Lung Cancer Survival Cohort (BLCSC), the motivating study for this work, has acquired abundant clinical, genetic, epigenetic and genomic data, which enable comprehensive investigations of molecular mechanisms underlying the lung cancer survival (McKay et al., 2017).High-dimensionality of the collected covariates has confronted the traditional parameter estimation and uncertainty quantification based on Cox models.In high-dimensional settings, where the number of covariates p increases with the sample size n or even greater than n, the conventional maximum partial likelihood estimation is usually ill-conditioned.Penalized estimators have emerged as a powerful tool for simultaneous variable selection and estimation (Tibshirani, 1997;Fan and Li, 2002;Gui and Li, 2005;Antoniadis et al., 2010).Recently, Huang et al. (2013) and Kong and Nan (2014) derived the non-asymptotic oracle inequalities of the lasso estimator in the Cox model.However, none of these works dealt with statistical inference for Cox models with high-dimensional covariates.
Existing literature on inference for high-dimensional models mainly concerns linear regression.Zhang and Zhang (2014), van de Geer et al. (2014) and Javanmard and Montanari (2014) developed inference procedures for linear models, based on debiasing the lasso estimator via low-dimensional projection or inverting the Karush-Kuhn-Tucker condition.van de Geer et al. (2014) extended the debiased lasso idea to generalized linear models, using the nodewise lasso regression.Ning and Liu (2017) focused on hypothesis testing and devised decorrelated score, Wald and likelihood ratio tests for inference on a low-dimensional parameter in generalized linear models based on projection theory.
There has been limited progress in inference for the Cox model with high-dimensional covariates.Fang et al. (2017) developed decorrelated tests for hypothesis testing of low-dimensional components under high-dimensional Cox models, using ideas similar to Ning and Liu (2017).Kong et al. (2018) extended the debiased lasso approach in van de Geer et al. (2014) to potentially misspecified Cox models, and used the nodewise lasso regression to estimate the inverse information matrix.Yu et al. (2018) proposed a debiased lasso approach, by estimating the inverse information matrix with a CLIME estimator adapted from (Cai et al., 2011).Most of these works restricted the number of non-zero elements of each row in the inverse information matrix to be small, i.e. 0 sparsity.However, as found in Xia et al. (2020), the sparse inverse information matrix assumption has no practical interpretation beyond linear regression models, often fails to hold in the Cox model, and these methods cannot perform satisfactorily in high-dimensional Cox model settings.For example, as evidenced by our extensive simulations, these methods cannot correct biases of lasso estimators or construct confidence intervals with desired coverage probabilities, even when the number of regression coefficients is moderate relative to the sample size.
Our work is pertaining to the "large n, diverging p" framework where p < n and p is allowed to increase with n to infinity, which reflects the setting of the motivating BLCSC with n = 561 and p = 231.Under this framework, we draw inference based on Cox models without imposing sparsity to the inverse information matrix.Specifically, we propose a debiased lasso approach via solving a series of quadratic programming problems to estimate the inverse information matrix.We use quadratic programming as a means of balancing the bias-variance trade-off and avoiding the unrealistic 0 sparsity assumption for the large inverse information matrix in the Cox model.Our work adds to the literature in the following aspects.First, unlike Javanmard and Montanari (2014), our work entails careful treatment of the sum of non independently nor identically distributed terms in the empirical loss function, and we consider random designs instead of deterministic designs.Second, we find that the tuning parameter selection for the inverse information matrix estimation is crucial for bias correction.For example, a related work (Yu et al., 2018) proposed to select tuning parameters by minimizing the cross-validated difference between the product of the information matrix with its estimated inverse and the identity matrix, but was found to perform poorly.In contrast, we propose a cross-validation procedure to tune parameters by hard thresholding debiased estimates when solving the quadratic programming problems, which yields satisfactory numerical performance.
The article is organized as follows.Section 2 introduces the proposed debiased lasso approach, where the inverse information matrix is estimated via quadratic programming with a novel cross-validation procedure for selecting the tuning parameter.Section 3 lays the theoretical foundation for reliable inference on linear combinations of the Cox regression parameters using debiased lasso estimators.We examine the finite sample performance of our proposed method with simulation studies in Section 4, apply it to analyze the BLCSC data in Section 5, and conclude the paper with a few remarks in Section 6.We state several useful technical lemmas and provide proofs of the main theorems in the Appendix, and defer proofs of all the lemmas to the online supplementary materials.

Background and set-up
We introduce notation that will be used throughout this article.For a vector x = (x 1 , . . ., x r ) T ∈ R r , x ⊗0 = 1, x ⊗1 = x and x ⊗2 = xx T .The q -norm for x is x q = ( r j=1 |x j | q ) 1/q , q ≥ 1, and the 0 -norm is x 0 = r j=1 I(x j = 0).For a matrix A = (a ij ) ∈ R m×r , the induced matrix norm is defined as A Cox model stipulates that the hazard function for the underlying failure time T , conditional on a p-dimensional vector of covariates X = (X (1) , . . ., X (p) ) ∈ R p , is h(t|X) = h 0 (t) exp{X T β 0 }, where h 0 (t) is an unknown baseline hazard function and β 0 = (β 0 1 , . . ., β 0 p ) T ∈ R p is an unknown vector of regression coefficients.With T subject to right censoring, the observed survival time is Y = min(T, C), where the censoring time C is assumed to be independent of T given X.Let δ = 1(T ≤ C) denote the event indicator.Based on n independent and identically distributed observations {Y i , X i , δ i } n i=1 , the goal of the paper is to estimate and draw inference on the regression coefficients β 0 , when p < n but p → ∞ as n → ∞.

Debiasing the lasso estimator
When p is fixed, a natural approach for inferring β 0 is through maximum partial likelihood estimation (MPLE), which maximizes the log partial likelihood function However, with a diverging p of our interest, MPLE may suffer from numerical instability and yield unreliable inference; see Section 4.
A more commonly used approach, when p diverges to ∞ as n → ∞, is a lasso estimator, defined to be the minimizer of the following penalized negative log partial likelihood: where n (β) is the negative log partial likelihood function, i.e. the negative of (1), and λ n > 0 is a tuning parameter to be decided.The first and second order derivatives of n (β) with respect to β, that is, the score function and the information matrix, are respectively denoted by where µ r (t; β) = n −1 n j=1 1(Y j ≥ t)X ⊗r j exp{X T j β}, r = 0, 1, 2. We also define the weighted average covariate vector η n (t; The lasso estimates tend to be more stable because of the penalization.However, as the lasso estimator β incurs biases (Javanmard and Montanari, 2014), we consider a debiased lasso approach to remove its bias and draw inference.Analogous to van de Geer et al. (2014) for generalized linear models, we define a debiased lasso estimator for with − Θ ˙ n ( β) serving as the bias correction term, where Θ is an estimate of the inverse information matrix.A reliable estimator, Θ, is important to ensure the validity of the method.However, existing methods, most of which rely on 0 sparsity assumptions on the true inverse information matrix and use nodewise lasso or CLIME to estimate a sparse Θ, are found to perform poorly for Cox models.Not imposing any sparsity conditions on the inverse information matrix, we propose to estimate each row of Θ by solving the following quadratic programming problem for m (j = 1, . . ., p): where γ n ≥ 0 is a tuning parameter, e j is the vector with one at the jth element and zero elsewhere, and the p × p matrix In the end, we obtain Θ as a p × p matrix consisting of all p solutions to (3) as its corresponding row vectors.Of note, we use Σ in (3) in lieu of ¨ n ( β), which is for theoretical convenience that becomes evident in Section 3. In fact, under the assumptions in Section 3, we do have Σ− ¨ n ( β) ∞ = o P (1) with a desirable convergence rate (see the proof of Theorem 1 in the Appendix), and the numerical difference in the resulting debiased lasso estimators is negligible.
Our approach extends Javanmard and Montanari (2014) in a linear regression setting to survival models.However, as η n (Y i ; β) involves all subjects, Σ given in ( 4) is no longer a sum of independent and identically distributed terms, posing additional theoretical difficulties.We have addressed these challenges in our proofs.
Computationally, our proposed (3) can be implemented fairly fast for moderate dimensions and parallelized for high dimensions by using the R function solve.QP.Our simulations demonstrate its computational efficiency.

Selection of the tuning parameter
Selecting a proper tuning parameter γ n is critical for bias correction in b, which can be illustrated by a simulation study.We simulate n = 500 independent subjects, each with p = 100 independent covariates generated from N (0, 1).Only two coefficients in β 0 in the Cox model are non-zero, taking values of 1 and 0.3.The underlying survival time Y follows an exponential distribution with a rate of exp (X T β 0 ), and the censoring time is simulated from an exponential distribution with a rate of 0.2 exp (X T β 0 ), resulting in a censoring rate of about 20%. Figure 1 depicts how the estimation bias and the empirical coverage probability from the debiased lasso approach change as γ n ranges from 0 to 1, revealing that γ n within the shaded range would yield desirable inference results.
We have found that, when evaluating cross-validation criteria for choosing γ n , directly plugging in debiased estimates produces highly unstable values because of accumulative errors from inclusion of the estimates for a large number of noise covariates.Instead, we propose a cross-validation procedure by hardthresholding debiased estimates: splitting data randomly into K folds (K = 5 or 10), we use the kth fold to obtain a debiased lasso estimate b (k) , hard-threshold it and plug in the thresholded values for computing cross-validation criteria.Hard-thresholding is based on multiple testing with, for example, the Bonferroni correction.That is, we take the hard-thresholded values to be b Figure 1: Estimation bias and 95% confidence interval coverage probability for β 0 1 = 1 with the tuning parameter γ n ∈ [0, 1] in a simulated example with n = 500 observations and p = 100 independent covariates.The methods in comparison include the proposed debiased lasso with quadratic programming (QP), the maximum partial likelihood estimation (MPLE) and the oracle estimator (Oracle) obtained from fitting the true model.0 otherwise, where z α/(2p) is the upper (α/(2p))th percentile of N (0, 1), as determined by the asymptotic result given in Theorem 1.Then, letting (k) be the negative log partial likelihood [defined as in (1) but applied to the kth testing set] evaluated at b (k),HT , we choose γ n that gives the smallest cross-validated negative partial likelihood, K k=1 n (k) (k) , where n (k) is the number of observations in the kth testing set.Use of an alternative cross-validated partial likelihood (Verweij and van Houwelingen, 1993) gives similar results.

Theoretical results
We infer c T β 0 for a loading vector c ∈ R p or Aβ 0 for a loading matrix A ∈ R l×p , by studying the asymptotic properties for linear combinations of b.Denote the expectation of µ r (t; β) as µ r (t; and define population-level counterparts for η n (t; β) as η 0 (t; β) = µ 1 (t; β)/µ 0 (t; β), and for Σ in (4) as We enumerate sufficient conditions needed for establishing the theoretical properties of the debiased lasso estimator.
Assumption 3. The follow-up time stops at a finite time point τ > 0, where the probability For any t ∈ [0, τ ], we assume for some fixed function v(•; c) > 0.
Assumption 5.The matrix Σ β 0 has bounded eigenvalues, i.e. there exist two constants ζ min and represent the smallest and the largest eigenvalues of Σ β 0 .
It is common in the literature of high-dimensional inference to assume bounded covariates as in Assumption 1. Fang et al. (2017) and Kong et al. (2018) also posed Assumption 2 for the Cox model inference, i.e. uniform boundedness on the multiplicative hazard.Under Assumption 1, Assumption 2 can be implied by bounded overall signal β 0 1 .Assumption 3 is usually used for survival models with censored data (Andersen and Gill, 1982).Assumption 4 ensures the convergence of a predictable variation process in the Martingale central limit theorem and thus the asymptotic normality of the de-biased lasso estimator.Σ β 0 (t) can be viewed as the information matrix up to time point t.It is easy to see that Σ β 0 (τ ) = Σ β 0 and v(τ ; c) = 1.This assumption states that the limiting function v(t; c) also depends on c ∈ R p , the loading vector of interest, which is reasonable.The bounded eigenvalue condition on Σ β 0 in Assumption 5 is standard in inference for high-dimensional models.
Furthermore, assume Θ β 0 2 1,1 ps 0 log(p)/ √ n → 0 as n → ∞.Under Assumptions 1-5, for any c ∈ R p such that c 2 = 1 and c 1 ≤ a * with some absolute constant a * < ∞, we have Theorem 1 provides the foundation for drawing inference on the regression coefficients.In the following, Corollary 2(i) discusses the type I error and the power of testing H 0 : c T β 0 = a 0 based on Theorem 1, and Corollary 2(ii) ensures that the corresponding confidence interval achieve nominal coverage probability asymptotically.
Corollary 2. Suppose that the assumptions in Theorem 1 hold.
(i) To test a null hypothesis H 0 : c T β 0 = a 0 versus an alternative hypothesis H 1 : c T β 0 = a 1 , where a 1 = a 0 , with a known c ∈ R p and constant a 0 ∈ R, let the test statistic We construct a test function where z α/2 is the upper (α/2)th quantile of N (0, 1).Then, the type I error rate for the test φ(T ) satisfies P (φ(T ) = 1|H 0 ) → α, and the power under the alternative (ii) The two-sided level α confidence interval for c T β 0 can be constructed as With Theorem 1 and the Cramér-Wold device, we can also conduct simultaneous inference on multiple linear combinations, i.e.Aβ 0 for some l × p matrix A, as summarized in the following Theorem 3, with Assumption 4 replaced by its multivariate version, Assumption 6.Similarly, Corollary 4 provides the asymptotic results for hypothesis testing and confidence region in this setting.Assumption 6.Let Σ β 0 (t) be the same as in Assumption 4. For a fixed combination matrix of interest for any vector ω ∈ R l and any t ∈ [0, τ ], where v (•; A T ω) > 0 is some fixed function depending on A T ω.
Theorem 3. Let A be an l × p matrix of full row rank such that the number of rows l is fixed, A ∞,∞ = O(1) and AΘ β 0 A T → F for some fixed l × l matrix F .Assume that the two tuning parameters Assumptions 1-3, 5 and 6, we have Corollary 4. Suppose the assumptions in Theorem 3 hold.
(i) For the l × p matrix A in Theorem 3, under the null hypothesis H 0 : Proofs of Theorems 1 and 3 are provided in the Appendix.Corollaries 2 and 4 are directly obtained from Theorems 1 and 3, and their proofs are omitted.

Numerical experiments
For a total of n = 500 subjects, we simulate p = 20, 100, 200 covariates, respectively, and generate these covariates from N (0, Σ), where Σ = I p and AR(1) with the correlation parameter of 0.5 as two different setups.Each covariate is truncated at ±2.5.Concerning the specifications of the true regression coefficients β 0 , the first element β 0 1 varies from 0 to 2 with an equal step size of 0.2, four of the other elements are arbitrarily chosen to take values of 1, 1, 0.5 and 0.5, and the rest are set to be zero.The underlying survival times T and the censoring times C are independently generated from an exponential distribution with hazard h(t|X) = exp{X T β 0 }, and from Uniform(1, 20), respectively.Under each simulation configuration, 200 datasets are generated.
For the lasso estimator, we use 10-fold cross-validation to select the tuning parameter λ n .Five-fold cross-validation is used for tuning parameter selection in CLIME, QP and NW.For the hard-thresholding step used to select γ n as described in Section 2.3, we adopt the Bonferroni correction with the adjusted p-value threshold 0.1/p, where p is the number of covariates.
We compare these methods with respect to the bias of the estimated β 0 1 (the parameter of main interest), its model-based standard error, coverage probability with a significance level of α = 0.05 and mean squared error.Figures 2 and 3 show the results for the independent and the AR(1) covariance structures, respectively.When p = 20, our proposed method (QP) and the decorrelated Wald test (Decor) perform nearly as well as the oracle estimator (Oracle) and MPLE.When the dimension is relatively large compared to the sample size, i.e. p = 100, 200, next to Oracle, the proposed estimator (QP) displays the smallest biases and the confidence intervals with coverage probabilities closest to the nominal level 95% for both covariance structures.On the other hand, NW, CLIME, Decor and MPLE incur substantial biases as the true value of β 0 increases.In addition, owing to the estimation of Θ β 0 using penalized approaches, the model-based standard error estimates using NW and CLIME are shrunk towards zero, underestimating the true variation.As such, the four competing methods all present improper confidence interval coverage probabilities, whereas our proposed method retains nearly unbiased estimates with coverage probabilities close to the nominal level.
We next compare the time spent on computing Θ alone (Table 1) among solve.QP in the R package quadprog for the proposed quadratic programming procedure, and two commonly used R functions for CLIME, namely, clime in the package clime and sugm in the package flare.Three candidate values of γ n , namely, 0.3, 1 and 2 times of log(p)/n, are used for demonstration.We fix β 0 1 = 1 and simulate n = 500 observations, with covariates having an AR(1) covariance structure and the rest of the setting being identical to what is described in the first paragraph of this section.The time columns in Table 1 report the average computing time over 10 replications on a MacBook with 2.7GHz Intel Core i5 processor and 8GB memory, and the ratio columns compare the average computing time of each programming procedure to that of solve.QP for each simulation setting, respectively.Under all of the scenarios examined, our proposed implementation with solve.QP is the most computationally efficient; for large dimensions, e.g., p = 200, clime takes the longest time per dataset on average.

Boston lung cancer data analysis
Lung cancer is the leading cause of cancer deaths in the United States, and non-small cell lung cancer (NSCLC), accounting for approximately 80% to 85% among all the lung cancer cases, is the most common histological type of lung cancer (Houston et al., 2018).Identification of genetic variants associated with lung cancer patient survival sparks modern translational cancer research, and has the potential to refine prognosis and promote individualized treatment and clinical care.Despite numerous studies investigating potential predisposing genes to lung cancer risks, studies on patient survival usually have small sample sizes and the reported genetic markers associated with lung cancer survival have been poorly replicated (Bossé and Amos, 2018).The Boston Lung Cancer Survival Cohort (BLCSC) is a large epidemiology cohort for investigating the molecular cause underlying lung cancer, where lung cancer cases have been enrolled at Massachusetts General Hospital and the Dana-Farber Cancer Institute from 1992 to present.We apply the proposed debiased lasso method (QP) to a BLCSC cohort with genetic data and simultaneously investigate the joint effects of certain genotyped SNPs on NSCLC patient overall survival.
Included in the analysis are n = 561 NSCLC patients with available diagnosis dates, follow-up times and genotypes on Axiom arrays.Among all these patients, 437 (77.9%) died and 124 (22.1%) were censored.The range of the observed survival time is from 6 days to 8584 days, and the restricted mean survival and censoring times at τ = 8584 days are 2124 (SE: 105) and 4397 (SE: 187) days, respectively.Patient characteristics, including age at diagnosis, race, education level, gender, smoking status, histological type, cancer stage, and treatment received, are provided in the online supplementary materials.
A conventional marginal association analysis (Tang et al., 2020) found two potentially functional SNPs Using the target gene approach, we focus on 32 genes in the CARM ER pathway, which is the largest pathway Tang et al. (2020) considered and described in their supplementary document and contains the two reported genes HDAC2 and PPARGC1A, plus 9 genes that Xia et al. (2020) studied to investigate whether the susceptibility loci are also associated with patient survival.We extract 312 genotyped SNPs from the 32 genes in the CARM ER pathway and the nine target genes described in Xia et al. (2020) from the BLCSC data (minor allele frequency > 0.01, genotype call rate > 95%).After a pruning step using PLINK (Purcell et al., 2007) to avoid multicolinearity caused by SNPs with high linkage disequilibrium, the number of SNPs is reduced to 217.SNPs are coded by the number of copies of the minor allele, i.e. 0, 1 or 2, and assumed to have additive effects on the log hazard ratio.Therefore, the subset of the BLCSC data we analyze include n = 561 NSCLC patients and p = 231 covariates.Table 2 summarizes the coefficient estimates in the Cox proportional hazards model for all patient characteristics and the top ten SNPs ranked by the p-values from the proposed method (QP).Results of two methods, QP versus MPLE, are listed side by side.In general, QP results in points estimates of smaller magnitudes and smaller standard errors compared to MPLE, which is consistent with our observation in the simulated example.MPLE is numerically very unstable when the dimension p is large compared to the sample size n.The numerical instability arises primarily from inverting the Hessian matrix, which may be closer to being singular.On the contrary, Lasso provides a more stabilized initial estimator.As a result, the debiased lasso estimator is numerically more stable than MPLE with narrower confidence intervals.When the dimension p is very small, the difference between the two methods becomes negligible.Among various patient characteristics, QP found that the adenocarcinoma subtype is significantly associated with better patient survival than large cell carcinoma, consistent with the results of Janssen-Heijnen and Coebergh (2001), which was, however, not detected by MPLE.QP further identified that AX-11672686 in CHRNA2, AX-11673610 in GRIP2 and AX-11264571 in BRCA2 are the three most significant SNPs associated with NSCLC patient survival, after adjusting for all the other demographic and genetic risk factors.Interestingly, AX-11672686 was found to be associated with nicotine dependence by Wang et al. (2014).AX-11264571 has been found to be associated with breast cancer (Qiu et al., 2010) and may also be associated with lung cancer susceptibility, although not achieving genome-wide significance in Yu et al. (2011).AX-11673610 or GRIP1 seems to be a new finding as, to our knowledge, they have yet been reported in the lung cancer literature (Bossé and Amos, 2018) To understand the impact of the socioeconomic status on cancer survival, we test for the association between education level (no high school, high school, or at least 1-2 years of college) and lung cancer patient survival.With a loading matrix A 2×p = (e 2 , e 3 ) T corresponding to the contrast of the effects of high school graduate and at least 1-2 years of college with the reference level of no high school, the test statistic is 0.259 with a p-value of 0.879, suggesting no statistical evidence for the association between education level and NSCLC patient survival, after adjusting all other demographic characteristics and genetic markers.The results confirm a large-scale clinical trial on lung cancer patients which reported "education level was not predictive of survival" (Herndon et al., 2008).
In summary, these results illustrate the utility of our method in providing reliable inference for scientific discovery and interpretation, while more in-depth biological investigations are warranted to validate our findings.

Concluding remarks
We have proposed a debiased lasso approach for reliable estimation and inference in the Cox proportional hazards model when p < n but is allowed to diverge to ∞ with n.Unlike existing methods (Fang et al., 2017;Yu et al., 2018;Kong et al., 2018), we resort a quadratic programming procedure to estimate the inverse information matrix, without imposing an unrealistic sparsity assumption on it.The proposed debiased lasso estimator is asymptotically unbiased and normally distributed under mild regularity conditions.Our simulations demonstrate that, when p is very small, the proposed method behaves similarly to the conventional MPLE; when p is relatively large, it outperforms the competitors in bias correction and confidence interval coverage.
Lastly, we touch upon the important issue of drawing inference with p > n, though not a main focus of this paper.First, several methods (Fang et al., 2017;Yu et al., 2018;Kong et al., 2018) had been developed for handling "p > n" inference problems; however, our analytical and simulation studies have pinpointed their possible limitations in providing sufficient bias correction and reliable confidence intervals even within the "large n, diverging p" framework, likely due to the sparsity assumptions on the inverse information matrix that may not hold in survival settings.One possible solution, by going beyond the de-biased lasso framework, is to perform repeated data splitting for model selection and estimation on two separate parts of the data and smooth the resulting estimates from multiple splits; see Fei and Li (2021) for inference on high dimensional generalized linear models.The validity of the method hinges upon the sure screening property for the initial model selection, and we will explore its use in a survival setting in the future.
, and M i (t; β 0 ) is a martingale with respect to the filtration F i (t) = σ{N i (s), 1(Y i ≥ s), X i : s ∈ (0, t]}.It follows that η n (t; β), and in particular, η n (t; β 0 ), is predictable with respect to the filtration F(t) = σ{N i (s), 1(Y i ≥ s), X i : s ∈ (0, t], i = 1, • • • , n}, an observation useful for derivations.Notation-wise, we do not distinguish between the usual expectation and the outer expectation. Lemma A1 below characterizes the difference between η n (t; β 0 ) and η 0 (t; β 0 ), which facilitates the proof of the asymptotic distribution for the leading term √ nc T Θ β 0 ˙ n (β 0 ) as well as the establishment of the convergence rate for Σ − Σ β 0 .
Lemma A1.Under Assumptions 1-3, we have Lemma A2 establishes the asymptotic distribution for the leading term Lemma A2.Assume p 2 log(p)/n → 0. Under Assumptions 1-5, for any c ∈ R p such that c 2 = 1 and Lemma A3 provides theoretical properties of the lasso estimator in the Cox model.This is a direct result from Theorem 1 in Kong and Nan (2014), and thus the proof is omitted.
Lemma A4.Under Assumptions 1-5, if λ n log(p)/n, with probability going to 1, we have Lemma A4 shows that, unlike in a linear regression model where the tuning parameter in the constraint takes the order of log(p)/n, the Cox model requires a potentially larger γ n for the feasibility of Θ β 0 depending on Θ β 0 1,1 , because the information matrix involves the regression coefficients.
Proof of Theorem 1.The first order Taylor expansion of ˙ nj ( β), the jth component in ˙ n ( β), at β 0 , is where β (j) lies between β and β 0 , and ¨ nj (β) denotes the jth column in the Hessian matrix ¨ n (β).Let the p × p matrix B n = ( ¨ n1 ( β (1) ), . . ., ¨ np ( β (p) )) T .Suppose c ∈ R p is a p-dimensional vector, and the parameter of interest is c T β 0 .Plugging (A1) in (2), we have The first term in (A2) is the leading part and is asymptotically normal as shown in Lemma A2, and the others will be proved to be asymptotically negligible.First, we show that √ nc T ( Θ − Θ β 0 ) ˙ n (β 0 ) = o P (1).By Lemma A5 and Lemma A6, Second, we show that Next, we show that By the proof of Lemma A4, we see that with Similar to the proof in Lemma A1, we can show that sup t∈[0,τ ] µ 2 (t; in the second term of (A4), by Assumption 1 and Lemma A1, is a sum of n independent and identically distributed mean zero terms, and each term , by Hoeffding's concentration inequality, It is easy to see that Then and thus for the third term in (A4), Therefore, by (A4), For the (j, k)th element in ¨ n (β), denoted as ¨ njk (β), by the mean value theorem, we have , where β (jk) lies in the segment between β (j) and β 0 .Under Assumptions 1-3, when β − β 0 1 ≤ δ for δ > 0 small enough, ∂ ¨ njk (β)/∂β ∞ is bounded by some constant related to δ uniformly for all (j, k).Then We show that the variance estimator is consistent, i.e. c T ( Θ − Θ β 0 )c → P 0 as n → ∞. |c Finally, by the arguments above and Slutsky's theorem, it holds that √ nc T ( b − β 0 )/(c T Θc) 1/2 D → N (0, 1).
Proof of Theorem 3. We prove Theorem 3 using the Cramér-Wold device.For any ω ∈ R l , where the dimension l is a fixed integer free of n and p, let c = A T ω in Theorem 1. Essentially, we only require c 1 = A T ω 1 is upper bounded, and it is not essential to force c 2 = 1.Since A ∞,∞ = O(1) (by assumption) and Supplementary Materials for "Statistical Inference for Cox Proportional Hazards Models with a Diverging Number of

Covariates"
We provide detailed proofs for the lemmas presented in the Appendix of the article, as well as patient characteristics of the Boston Lung Cancer Study Cohort data analyzed in Section 5.

S1 Technical proofs for the lemmas
Lemma A1 characterizes the difference between η n (t; β 0 ) and η 0 (t; β 0 ), which is needed to prove the asymptotic distribution for the leading term √ nc T Θ β 0 ˙ n (β 0 ) as well as to establish the convergence rate for Σ − Σ β 0 .
→ 0, and moreover, by Theorem 2.14.9 of van der Vaart and Wellner (1996) with V = 2, for every s > 0 and a constant D > 0 that only depends on K 1 .Setting s = 2 log(p) implies that For the second statement, we consider the classes of functions of (x, y) By Theorem 2.14.9 of van der Vaart and Wellner (1996) with V = 2, we have for every s > 0, where D is a constant that only depends on K and K 1 , and µ 1k and µ 1k are the kth components of µ 1 and µ 1 , respectively.Thus, For example, taking s = 2 log(p) would complete the proof for sup t∈[0,τ ] µ 1 (t; Finally, we rewrite almost surely, we have Therefore, Lemma A2 establishes the asymptotic distribution for the leading term Combining (S2) and (S3), we have that, uniformly for all t ∈ [0, τ ], Then By Assumption 4, U (t) → P v(t; c).Now we check the Lindeberg condition.For any > 0, define the truncated process with a predictable variation process: and Finally, by the martingale central limit theorem, the asymptotic normality follows.
Lemma A3 provides the theoretical properties of the lasso estimator in the Cox model.This is a direct result from Theorem 1 in Kong and Nan (2014), and thus the proof is omitted.
Lemma A4.Under Assumptions 1-5, if λ n log(p)/n, with probability going to 1, we have Lemma A4 shows that, unlike linear models with the tuning parameter in the constraint taking the order of log(p)/n, the Cox model requires a potentially larger γ n for the feasibility of Θ β 0 that depends on Θ β 0 1,1 , as the information matrix involves the regression coefficients.

S2 Boston Lung Cancer Study Cohort data
Table 3 shows the patient characteristics for the subset of the Boston Lung Cancer Study Cohort data analyzed in Section 5.
For two positive sequences {d n } and {g n }, we define d n g n if there are two bounded positive constants C and C such that C ≤ d n /g n ≤ C .

Figure 2 :
Figure 2: Estimation bias, coverage probability, model-based standard error and mean squared error for six estimators in comparison, QP (solid green lines), NW (short-dash navy blue lines), CLIME (dotted red lines), Decor (dot-dash pink lines), Oracle (long-dash orange lines), and MPLE (two-dash light blue lines), based on 200 simulations, each with n = 500 observations and independent covariance structure for covariates.

Table 1 :
Comparison of the computational time spent on computing Θ.Time (in seconds) is averaged over 10 replications under each setting.Time ratio is with respect to the proposed method implemented using solve.QP.
in the genes HDAC2 and PPARGC1A that were significantly associated with NSCLC overall survival.

Table 2 :
Coefficient estimates in the Cox proportional hazards model for the Boston Lung Cancer Study data SE: standard error estimate; CI: confidence interval; HS: high school; AD: Adenocarcinoma; SCC: squamous cell carcinoma; LCC: large cell carcinoma; Pos: physical location based on Assembly GRCh37/hg19.

Table 3 :
Characteristics of n = 561 patients in the Boston Lung Cancer Study for survival analysis Stages I and II classified as early stage, and stages III and IV as late stage.b No treatment information on surgery, chemotherapy or radiation available for these patients. a