Identification of biomarker‐by‐treatment interactions in randomized clinical trials with survival outcomes and high‐dimensional spaces

Stratified medicine seeks to identify biomarkers or parsimonious gene signatures distinguishing patients that will benefit most from a targeted treatment. We evaluated 12 approaches in high‐dimensional Cox models in randomized clinical trials: penalization of the biomarker main effects and biomarker‐by‐treatment interactions (full‐lasso, three kinds of adaptive lasso, ridge+lasso and group‐lasso); dimensionality reduction of the main effect matrix via linear combinations (PCA+lasso (where PCA is principal components analysis) or PLS+lasso (where PLS is partial least squares)); penalization of modified covariates or of the arm‐specific biomarker effects (two‐I model); gradient boosting; and univariate approach with control of multiple testing. We compared these methods via simulations, evaluating their selection abilities in null and alternative scenarios. We varied the number of biomarkers, of nonnull main effects and true biomarker‐by‐treatment interactions. We also proposed a novel measure evaluating the interaction strength of the developed gene signatures. In the null scenarios, the group‐lasso, two‐I model, and gradient boosting performed poorly in the presence of nonnull main effects, and performed well in alternative scenarios with also high interaction strength. The adaptive lasso with grouped weights was too conservative. The modified covariates, PCA+lasso, PLS+lasso, and ridge+lasso performed moderately. The full‐lasso and adaptive lassos performed well, with the exception of the full‐lasso in the presence of only nonnull main effects. The univariate approach performed poorly in alternative scenarios. We also illustrate the methods using gene expression data from 614 breast cancer patients treated with adjuvant chemotherapy.


Introduction
In randomized controlled trials (RCT), the treatment benefit is often measured as an average across the study population but, in the era of stratified medicine, an increasing interest is devoted to identify patients more likely to benefit from the treatment. Unaccounted biomarker-by-treatment interactions can dramatically lower the statistical power (Betensky et al., 2002), which may be one of the reasons why RCTs have often failed to show a benefit of the drug in oncology (Buyse and Michiels, 2013). Hence, it is important to identify in RCTs (prospectively or retrospectively) these biomarker-by-treatment interactions, also called treatment-effect modifiers or predictive biomarkers (Michiels et al., 2007;Royston and Sauerbrei, 2008;Buyse et al., 2013). An example is the phase III IPASS trial that showed an overall benefit on progression-free survival (PFS) of gefitinib as compared to carboplatin plus paclitaxel in advanced nonsmall cell lung cancer patients (Mok et al., 2009), with a high benefit in patients with epidermal growth factor receptor (EGFR) mutation and a harm in patients without it. Another treatment-effect modifier is the Kirsten rat sarcoma (KRAS) mutation for the effect on PFS of anti-EGFR monoclonal antibodies in advanced colorectal cancer (Amado et al., 2008). Recently, also gene signatures (i.e., combination of multiple treatment-effect modifiers) have been proposed: for example, an 8-gene and a 14-gene signature for trastuzumab in early breast cancer (Pogue-Geile et al., 2013;Perez et al., 2015) and a 84-gene signature for MAGE-A3 immunotherapy in melanoma and nonsmall cell lung cancer (Ulloa-Montoya et al., 2013).
From a statistical viewpoint, Rothwell (2005) put forward that the only reliable approach for assessing the predictiveness of biomarkers is to test their interaction with the treatment. Thus, the general framework for identifying treatment-effect modifiers is a model with the main effects of both the treatment and the biomarkers, and the biomarker-by-treatment interactions. However, with genomic biomarkers, high dimensionality (much more biomarkers p than sample size n) often makes the model nonidentifiable. We propose several methods to select a sparse set of treatment-effect modifiers among a large number of candidates in an RCT in the framework of Cox regression models (Cox, 1972). In Section 2, we describe 12 possible approaches. In Section 3, we evaluate via simulations their selection performance and we consider a novel measure to evaluate the biomarker-by-treatment interaction strength of the developed gene signatures. In Section 4, we present an analysis of publicly available gene expression data in breast cancer. In Section 5, we discuss the findings.

Penalties on the full biomarker-by-treatment interaction model
In a proportional hazards regression model, the full biomarker-by-treatment interaction model is with α, β, and γ the regression coefficients for the treatment T (+0.5 the experimental and −0.5 the control arm), the standardized biomarkers X i , i = 1, . . . , p, and their interactions X i T , respectively. Of note, the second sum in (1) is the component estimating the biomarker-dependent treatment effect. Semiparametric estimates (Cox, 1972) of the regression parameters are obtained by maximizing the partial log-likelihood l (β). However, in a high-dimensional setting (2p + 1 n), the model is nonidentifiable. To overcome this issue, penalized regression maximizes the penalized log-likelihood l (β, γ ) − p(λ, β, γ ), by adding a penalty p(λ, β, γ ) such as the ridge (Hoerl and Kennard, 1970) or the lasso (Tibshirani, 1996(Tibshirani, , 1997. In all cases that we consider here, penalization is applied to (β i , γ i ) whereas α remains unpenalized.
The first penalization considered uses the lasso penalty for both the β i 's and the γ i 's to perform variable selection and hence, identify treatment-effect modifiers: The main effects and the interactions are equally penalized (same shrinkage parameter λ). We call this approach full-lasso throughout the paper. This approach lacks the hierarchy constraint: the main C 2016 The Authors. Biometrical Journal Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com effect of a biomarker can be discarded (β i = 0) irrespective of whether the associated interaction γ i is. Although this can affect the interpretability of γ i and the calibration of the model (Bien et al., 2013), this is of minor importance in the context of selection.
Despite the simplicity of this method, main effects and interactions can have very different sizes. It seems more appropriate to penalize them unequally: using differently weighted penalties for the β i 's and the γ i 's, the approach is similar to the adaptive lasso (alasso), which penalizes large coefficients less than smaller ones to stress differences between them (Zou, 2006;Zhang and Lu, 2007). In this spirit, we estimate weights in a preliminary model including the treatment and all the biomarker main effects β Ri and interactions with the treatment γ Ri , with a ridge penalty on the β Ri 's and the γ Ri 's Ri to control their magnitude. In a second stage, two strategies to estimate the weights are considered: one with biomarker-specific (alasso (Sw)) and one with grouped (alasso (Gw)) weights. The alasso (Sw) strategy estimates weights as the inverse of the absolute ridge coefficients: The alasso (Gw) strategy estimates a common weight for all β i 's and one for all γ i 's as their average: In order to force the hierarchy constraint, we considered a further approach (ridge+lasso) with a ridge penalty on the main effects, which are then all kept in the model while controlling for overfitting: Because optimization of both λ and λ 2 is computationally demanding and weakens generalizability, we favor precise optimization of λ with rough selection of λ 2 . In the context of the elastic net, Zou and Hastie (2005) estimated λ for a grid of given λ 2 and chose the best (λ, λ 2 ). In our case, we first estimate the β i 's in a model without interactions and then fixed them via an offset in the final model (1). Sensitivity analyses showed no relevant difference between the two strategies (data not shown).
To the aim of forcing the hierarchy constraint while performing selection on both the β i 's and the γ i , we also consider a group-lasso approach (Yuan and Lin, 2006), which selects prespecified groups of variables. In our context, p groups (β i , γ i ) are defined:

The modified covariates approach
Rather than the full interaction model (1), Tian et al. (2014) proposed a model with no main effects and modified covariates, defined as the product of each X i and the treatment T (= ±0.5): This corresponds to model (1) with only the interaction part and no main effects. The lasso penalty is used to perform variable selection in model (2).

Dimension reduction of the biomarker main effects
A compromise between including only interactions and all the main effects is achieved by reducing the space dimension of the main effects through a principal components analysis (PCA, Hastie et al., 2001). However, the principal components are computed independently of the outcome. To overcome this drawback, partial least squares (PLS, Martens and Naes, 1989) regression computes the linear combinations that are the most associated to survival. These orthogonal transformations allow reducing the number of parameters in the model: with Z 1 , . . . , Z K the first K p linear combinations of the biomarkers. We apply a lasso penalty to the interactions for identifying treatment-effect modifiers and α remains unpenalized.

The gradient boosting
Boosting techniques exploit the repeated fitting of a weak estimator to iteratively obtain a model maximizing the partial log-likelihood. In a high-dimensional setting, the process starts from the null model and one coefficient is updated at each step. This iterative process stops when the model achieves a balance between bias and variance. In this study, we focus on gradient boosting (Friedman, 2001;Bühlmann and Yu, 2003) which consists in a componentwise L2-boosting: the possible updates are computed on each predictor, and the best is retained after penalization to update the model. The treatment effect is estimated preliminarily and then fixed as an offset. The gradient boosting does not impose the hierarchy constraint.

Univariate approach with FDR control
A straightforward approach consists in testing the interaction of the treatment with each biomarker through a Wald test in a model incorporating the treatment, the biomarker and their interaction . In order to perform variable selection, we adjust the interaction p-values using the Benjamini and Hochberg procedure (1995) and keep biomarkers with adjusted p-value < 0.05.

Arm-specific prognostic effects
The last two approaches considered estimate the arm-specific prognostic effects of biomarkers and then compare them to identify treatment-effect modifiers. The so-called two-interaction model (two-I model) replaces in model (1) the main effects by a second interaction term for the control arm: Alternatively, we incorporate the estimates of the two-interaction model (4) in the full interaction model (1) as weights of an adaptive lasso penalty. Let γ Ri+ and γ Ri− be the estimates of model (4) under a ridge penalty; model (1) is now penalized as follows: These arm-specific prognostic weights (Aspw) yield smaller weights for interactions with larger difference between the two arm-specific prognostic effects. In addition, the higher such difference, the lower the weight of the associated main effect, which favors (even though not forces) the hierarchy constraint.

Implementation
The methods outlined so far were implemented in R v3.1.2 with packages: glmnet (Friedman et al., 2010;Friedman et al., 2016) for lasso and ridge penalties, grplasso (Meier, 2015) for group-lasso penalty, corpcor (Schäfer et al., 2015) for principal components, plsRcox (Bertrand et al., 2014) for PLS, and mboost (Hothorn et al., 2016) for gradient boosting. Of note, the grplasso R-package can only be used for generalized linear models, thus, we implemented a Poisson model over two-month intervals, corresponding to a piecewise constant hazard model which approximates rather well the Breslow estimator in the Cox model (Pawitan, 2005). We use fivefold cross-validation (CV, van Houwelingen, 1993, 1994) to estimate the shrinkage parameters λ and λ 2 , the number of principal components K, and the number of iterations for the gradient boosting (with shrinkage factor (ν) fixed to 0.1). As CV for the PLS gave extremely poor results (data not shown), we decided to keep only the first component. The R code is provided in the supplementary material.

Simulation study
We performed simulations to compare methods in terms of selection of the treatment-effect modifiers in high-dimensional Cox models. We also evaluated the interaction strength of the selected biomarkers.

Data generation
We generated p = 500 or 1000 unit-variance (σ 2 = 1) Gaussian biomarkers with autoregressive correlation (σ i j = 0.7 |i− j| ) within 25-or 50-biomarker blocks, respectively. A total of n = 500 patients per data set were randomly assigned (1:1) to the experimental or control arm. We generated exponential survival times by fixing for each treatment arm: m 0 , the baseline (i.e., x T = (0, . . . , 0) T ) median survival time; m 1 , the median survival time for a one-unit increase of an active biomarker; and the associated log-hazard ratio β = log(m 0 /m 1 ). We generated independent censoring from a U (2, 5) distribution, reflecting a trial with three-year accrual and two-year follow-up. Table 1 shows the null (i.e., no treatment-effect modifier) and alternative scenarios (i.e., at least one treatment-effect modifier) considered in the simulations. In the complete null scenario, m 0 = m 1 = 1, whatever the treatment. In the second null scenario, there was a strong treatment effect and in the third, 10 (for p = 500) or 20 (for p = 1000) biomarkers were strongly prognostic. In alternative scenarios 4 and 5, no biomarker had an effect in the control arm, while one and 10 or 20 treatment-effect modifiers doubled the median survival time in the treatment arm (HR+ = 0.5). In scenario 6, we combined prognostic biomarkers (HR+ = HR-= 0.5) and treatment-effect modifiers (HR-= 1, HR+ = 0.5).  Simulations were performed with p = 500 (scenarios 1a-6a) and p = 1000 (scenarios 1b-6b). The censoring rate was 10-37%. We also considered a setting tuned on the application of Section 4 (scenario 1c-6c): higher censoring (60-80%) and lower biomarker effects (randomly drawn from U (−0.5, −0.1) for main effects and U (−0.7, −0.1) for interactions). The biomarkers were randomly allocated to correlation blocks, independently of their effects. As sensitivity analyses, we considered Weibull times with shape 0.5 and 2 (Supporting Information Fig. S1), and we forced correlation between predictive, and between predictive and prognostic biomarkers (Supporting Information Fig. S2). For each data set, another one with the same parameters was generated for external validation to evaluate the interaction strength of the developed signatures.

Evaluation criteria
The primary objective of this study was to compare methods to correctly identify the true treatmenteffect modifiers. The secondary objective was to evaluate how well the selected biomarkers predict a differential treatment effect for future patients. We considered two criteria for selection of true positive (TP) treatment-effect modifiers: the false discovery rate FDR = FP/(TP + FP), that is, the rate of false positive (FP) biomarkers among those selected (Genovese and Wasserman, 2002), and the false negative rate FNR = FN/(TP + FN), that is, the rate of false negatives (FNs) among the true treatment-effect modifiers (Pawitan et al., 2013). In scenarios including main effects, we reported the number of prognostic FPs. We also computed the area under the precision-recall curve (AUPRC, Davis and Goadrich, 2006), translating the ability of discarding inactive biomarkers more likely than active ones independently of the tuning parameters. The AUPRC, based on FDR and FNR, is more pertinent than the area under the ROC curve when there are many more inactive than active biomarkers.
Selection methods can also serve as a global test of the presence of any interaction signal in the biomarker set Michiels and Rotolo, 2015); the empirical rejection probability is the type-I error rate in null scenarios (corresponding to the FDR as any selected biomarker is a FP), and is the power in the alternative scenarios.
The biomarker-treatment score for patients in the validation set (V) is the cross product between the coefficientsγ tr of the interactions retained in the training set (tr) and their biomarkers: The lower the score, the higher the treatment benefit. We propose a gene signature interaction strength criterion similar to Schemper (1988) and Michiels et al. (2011), measuring the concordance betweenη j and the survival time in each treatment arm. We estimated such within-arm concordance via the C-statistic of Uno, one of the least biased estimators in the presence of censoring (Uno et al., 2011). Then, we computed the difference of the two C-statistics ( C-statistics): the larger the difference, the higher the interaction strength. We computed the C-statistics both in the training and validation sets.

Results
Tables 2 and 3 and Figs. 1 and 2 summarize the results of simulations. The methods can be grouped into four main groups according to their performances.
The univariate approach controlled well the type-I error in null scenarios (Table 2), but was very conservative in alternative scenarios, especially in presence of main effects (Table 3). This is illustrated by low FDR and large FNR (Fig. 1), and by very low power (e.g., 0.59 and 0.43 in scenarios 6a and 6b, Table 2). Low FDR and large FNR are not due to the adjusted p threshold chosen for discarding biomarkers, as shown by the AUPRC, which is often ࣘ0.5 (Table 3). Consequently, the interaction strength is low (Fig. 2).
The second group contains methods not performing selection of the main effects: the modified covariates, dimension reduction approaches (PCA+lasso and PLS+lasso), and ridge+lasso. In the null scenarios, their type-I error was moderate (0.24-0.58) to high (0.49-0.61) with low and high censoring, respectively (Table 2). In alternative scenarios (Fig. 1, Table 3) with no main effects (scenarios 4-5), these methods identified most of the treatment-effect modifiers (small FNR), but together with many FPs (large FDR); they predicted fairly well the individual treatment benefit (Fig. 2). However, in presence of main effects (scenario 6), these methods were more conservative and missed most of treatment-effect modifiers, implying low power (Table 2) and low interaction strength (Fig. 2); these figures were lower for p = 1000 than for p = 500. In general, the PLS+lasso and ridge+lasso, which  account for the outcome in dimension reduction of the main effect matrix, performed slightly better than the PCA+lasso and modified covariates. This was less evident with smaller biomarker effects (scenario 6c). In addition, the poor results of the modified covariates were amplified when treatment-effect modifiers were also correlated with prognostic biomarkers (Supporting Information Fig. S2) while the PCA+lasso and the PLS+lasso were not impacted by such correlation. The AUPRC of these four methods was often ࣘ0.5. Of note, in the alternative scenarios, their interaction strength C-statistics were highly impacted by overfitting (Fig. 2), especially for the modified covariates and the ridge+lasso.
In the third group, the group-lasso and the two-I model identified properly the treatment-effect modifiers in general, but often together with several prognostic biomarkers, notably in null scenarios. In null scenarios without prognostic biomarkers (1-2a and 1-2b), the two-I model and the group-lasso had type-I error of 0.38-0.44 and 0.48-0.56, respectively, which increased to 1 in presence of main effects (scenario 3, Table 2). On the other hand, these methods identified almost all the treatmenteffect modifiers in alternative scenarios (Table 3), but with a lot (often >100) of FPs, notably for p = 1000. This led to very high FDR (Fig. 1) and power (Table 2). Despite the high number of FPs, the group-lasso had fairly large AUPRC (Table 3). However, in the scenario 6, it selected all the prognostic biomarkers (i.e., FPp), thus yielding high FDR and low AUPRC. Both the two-I model and the group-lasso had quite high C-statistics (Fig. 2).
In the fourth group, we found methods that do not impose the hierarchy constraint: the full-lasso and the three kinds of adaptive lasso (with either specific (Sw), grouped (Gw), or arm-specific prognostic (Aspw) weights). In the null scenarios 1 and 2, the adaptive lasso (Sw or Gw) and the full-lasso selected no biomarkers in most of cases, resulting in a low type-I error (Table 2). However, the latter was highly affected by the prognostic biomarkers (scenarios 3a and 3b, type-I error: 0.88-0.98). The adaptive lasso (Aspw) had a moderately high type-I error in scenarios 1-2a and 1-2b (0.36-0.42), and large in presence of only prognostic biomarkers (0.55, p = 1000). In the alternative scenarios, the full-lasso and two adaptive lassos (Sw and Aspw) identified most of the treatment-effect modifiers, with slightly worse results for p = 1000 than for p = 500. With p = 500, they well identified the treatmenteffect modifiers with few FPs and performed the best in terms of FNR, FDR (Table 3, Fig. 1) and interaction strength (Fig. 2). However, with p = 1000, the adaptive lasso (Sw and Aspw) performed much worse in presence of main effects (scenario 6b) by selecting much less biomarkers (on average: 14 and 13 biomarkers, respectively, and FNR = 0.61, Table 3); this poor results were amplified when treatment-effect modifiers were also correlated with prognostic biomarkers (Supporting Information Fig. S2). The adaptive lasso (Gw) selected the null model in most of cases (leading to a FDR 0, FNR 1, low power and interaction strength; Table 2, Figs. 1 and 2). However, in terms of AUPRC, it performed as well as the full-lasso and the two other versions of adaptive lasso (Table 3). This means that the selected tuning parameter is too high and that the method could perform better by selecting a lower parameter. With higher censoring and lower effects sizes (scenarios 4-6c), the four methods performed similarly to the other methods, by selecting fewer biomarkers (Table 3) and with low AUPRC (Table 3). Although not forced, the hierarchy constraint is maintained for more than 50% of the interactions in alternative scenarios.
The gradient boosting did not behave like any of the abovementioned groups of methods. In null scenarios, similarly to the group-lasso, it had high type-I error (0.66-1) even with low censoring (scenarios 1-3a and 1-3b; Table 2). However, in alternative scenarios, it performed as well as the methods not imposing the hierarchy constraint: good selection of the treatment-effect modifiers, with few FPs (Table 3). In addition, it was not that conservative with p = 1000. The gradient boosting had quite good interaction strength, but was highly impacted by overfitting in some scenarios (Fig. 2).
In the sensitivity analysis generating data from Weibull distribution, the variation of the hazards over time (decreasing or increasing) did not affect the relative performances of any method (Supporting Information Fig. S1).

Application
We applied the methods compared so far to publicly available data from the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo, last accessed on June 1, 2016) to identify treatmenteffect modifiers in 614 breast cancer patients (Desmedt et al., 2011;Hatzis et al., 2011) receiving anthracycline-based adjuvant chemotherapy with (n = 507) or without (n = 107) taxane. The threeyear distant-relapse free survival was 78% (95% CI: 74-82%) and 79% (95% CI: 71-87%), respectively. We preprocessed the expression data of 22,277 genes (Affymetrix array) via frozen robust multiarray (McCall et al., 2010) and cross-platform normalization (Shabalin et al., 2008). We removed genes with interquartile range ࣘ1. We standardized the remaining p = 1689 genes.  Patients were randomly assigned to a training set (315 patients) and a validation set (299 patients). The results (Table 4) show a large spread in the number of selected biomarkers between methods: 0-39. The modified covariates, PCA+lasso, PLS+lasso, and ridge+lasso were highly concordant, with 14 biomarkers selected by at least three methods. Most of those selected by the two-I model (10/34) and group lasso (3/4) were not selected by the other methods. All the biomarkers selected by the univariate approach were also selected by the gradient boosting. The alasso (Aspw) identified two genes and the alasso (Sw) identified only the IFIH1 gene, common to almost all the other methods except for the group-lasso and the univariate approaches. Importantly, the IFIH1 gene expression is already known for being associated with recurrence in nonresponders to taxane-based chemotherapy in early breast cancer (Magbanua et al., 2015) and is included in two patents to predict the benefit of taxanes (Gehrmann and Von Törne, 2009;Wang et al., 2014). Functional studies also suggested that the IFIH1 expression is associated with resistance to taxanes in prostate cancer (Marín-Aguilera et al., 2011). This single gene had a moderate interaction strength ( C-statistics = 0.06). By selecting more biomarkers, the other methods slightly improved interaction strength (Table 4). Interestingly, the gradient boosting and the alasso (Aspw) had the highest C-statistics (0.18 and 0.14, respectively).

Discussion
Parsimonious gene signatures aim at selecting patients more likely to benefit from a treatment (Buyse and Michiels, 2013;Hingorani et al., 2013). Although this needs identifying the treatment-effect modifiers , no clear guidance has been established yet on how to do in highdimensional spaces. Biomarker selection can have multiple aims, at least two of which of particular interest: selecting markers which have a biologic role and selecting patients likely to benefit from the therapy. In this paper, we focused on sparse selection and compared 12 methods to find treatment-effect modifiers while limiting the FP selection. Indeed, for practical use of the biomarker score, determining a few biomarkers with high accuracy (e.g., ELISA) is usually preferred compared to less-accurate technologies (e.g., mass spectroscopy). Obviously, the chosen methods undersample all the possible approaches. We also proposed a novel metric about interaction strength prediction because this is tightly related to correct biomarker selection.
Based on the results of the simulation study, different groups of methods were identified. First, the straightforward way for identifying interactions via univariate models should be avoided when C 2016 The Authors. Biometrical Journal Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com biomarkers are correlated, even with FDR control. If a method starting from the null model would be considered, the gradient boosting is an option, but it does not control at all the type-I error in null scenarios. Methods that do not perform selection of the main effects are quite well powered, except in presence of prognostic biomarkers as compared to methods performing selection on the main effects. Of note, keeping all the main effects could be a practical drawback, as sparse prognostic signatures can be assessed easily and reliably on different platforms (e.g., by RT-PCR). Dimension reduction of the main effect matrix independently of the outcome (PCA) has negative impact in presence of strong main effects. Conversely, methods performing variable selection on the main effects perform globally well, irrespective of whether the hierarchy constraint is forced. As an exception, the group-lasso is highly impacted by the presence of main effects due to the groups that can not separate the main effect and the interaction of each biomarker. To overcome this drawback, testing strategies could be considered to evaluate remaining interactions (Lockhart et al., 2014). In any case, further arbitrary choices would be required. Another further development could consist in an adaptive group-lasso with weights based on the interaction only. The model estimating the arm-specific prognostic effects identified rather well the treatment-effect modifiers in the simulations study. Nevertheless, to allow for treatment modifiers that have prognostic effects in each of the treatment arms, a contrast test would be needed between the two prognostic effects, which is not straightforward in a penalized model. Finally, we would like to stress: (i) the importance of external data for investigating the interaction strength, as most of the methods are impacted by overfitting, notably the modified covariates and the two-step methods; (ii) and that the identification of interactions is difficult and requires a lot of events, whatever the method.
For all but one methods, the list of interactions retained is defined by a tuning parameter estimated via the van Houwelingen (1993, 1994) CV criterion. However, note that this technique is sometimes suboptimal, finding too conservative tuning parameters for the adaptive lasso (Gw) and too lenient ones for the group-lasso. For this latter case, we previously reviewed and compared empirical extensions of the lasso penalty to reduce the FPs in high-dimensional Cox regression models (Ternès et al., 2016).
In the application to the gene expression data set in breast cancer, the methods behaved quite differently in terms of number of selected biomarkers and of interaction strength. This application can only be viewed as an illustration, as no gold standard is available to infer which biomarker truly modifies the treatment effect and the expression data were not collected in the context of a randomized clinical trial. However, the gene selected by most of the methods was already suggested to predict resistance to taxane-based chemotherapy in the literature. Adjusting the models on clinical factors was beyond the aim of the present study but is another interesting point. Such factors could be considered as prognostic factors not subjected to competitive selection, or they could be used as candidates for treatment-covariate interactions, just like the biomarkers. Yet, no consensus has been reached on how to deal with clinical factors in this context. For a fair comparison, we estimated the effect of the clinical factors (nodal status and tumor grade) and fixed them as an offset for all the methods. The R code used to download the clinical and gene expression data and to apply the 12 methods is available as supplementary material.