Two-stage penalized regression screening to detect biomarker-treatment interactions in randomized clinical trials

High-dimensional biomarkers such as genomics are increasingly being measured in randomized clinical trials. Consequently, there is a growing interest in developing methods that improve the power to detect biomarker–treatment interactions. We adapt recently proposed two-stage interaction detecting procedures in the setting of randomized clinical trials. We also propose a new stage 1 multivariate screening strategy using ridge regression to account for correlations among biomarkers. For this multivariate screening, we prove the asymptotic between-stage independence, required for familywise error rate control, under biomarker–treatment independence. Simulation results show that in various scenarios, the ridge regression screening procedure can provide substantially greater power than the traditional one-biomarker-at-a-time screening procedure in highly correlated data. We also exemplify our approach in two real clinical trial data applications.

and treatments in randomized clinical trials is of growing interest.
Discovering biomarker-treatment interactions helps identify predictive biomarkers: biomarkers that influence treatment efficacy can be used to find the subgroup of patients who are most likely to benefit from the new treatment, as well as to predict subgroup treatment effects. Consequently, new adaptive design approaches can be used in settings where there are genetically driven subgroups to improve efficiency (Wason et al., 2015). Furthermore, the discovery of novel biomarker-treatment interactions may result in the identification of new disease susceptibility loci, providing insights into the biology of diseases. Such outcomes are very much aligned with the goals of precision medicine: to enable the provision of "the right drug at the right dose to the right patient" (Collins and Varmus, 2015).
Detecting biomarker-treatment interactions in largescale studies of human populations is a nontrivial task, which faces several challenging problems (McAllister et al., 2017). Traditional interaction analysis, using regression models to test biomarker-treatment interactions one biomarker at a time, may suffer from poor power when there is a large multiple testing burden, for example, when performing such analysis on a genome-wide scale for genetic biomarkers. Standard genotyping microarrays measure half a million or more variants and, when combined with whole genome imputation, can lead to millions of biomarkers to consider. Another type of omics, metabolomics-the measurement of metabolite concentrations in the body-may have a more direct effect on drug efficacy and is also becoming increasingly widely assayed (Beckonert et al., 2007).
In the context of gene-environment interaction studies, there is now a significant literature of statistical methods, which exploit aspects of the study design to improve power thus mitigating the multiple testing burden. These include case-only tests (Piegorsch et al., 1994), empirical Bayes (Mukherjee and Chatterjee, 2008), Bayesian model averaging (Li and Conti, 2008), and two-stage tests with different screening procedures (Kooperberg and LeBlanc, 2008;Murcray et al., 2008;Wason and Dudbridge, 2012;Gauderman et al., 2013). To alleviate the multiple testing burden, two-stage methods use independent information from the data to perform a screening test to select a subset of genetic biomarkers and then only test interactions within this reduced set. Since there is a clear analogy to gene-environment interaction problems, in this paper, we will examine how existing gene-environment interaction testing methods may be modified so that they are transferable to the biomarker-treatment setting (Dai et al., 2009Wang and Dai, 2016). One significant drawback of the traditional two-stage approach testing each biomarker one at a time is that the univariate screening tests will harm power of the overall two-stage procedure when there exist substantial correlations between biomarkers. We also propose a novel screening test in this two-stage framework, which utilizes ridge regression to model correlated high-dimensional data at stage 1. We prove that this new two-stage method is able to preserve the overall familywise error rate given independence between the treatment and biomarkers. Furthermore, it is shown by simulations and real data applications that the new method can provide better performance than traditional one-biomarker-at-a-time approaches for correlated biomarkers. In the context of more general variable selection settings, screening strategies have been explored to focus algorithms on a reduced search space (Fan and Lv, 2008;Wang and Leng, 2016). In this work, we explore the use of variable prescreening specifically to help identify interactions and the condition required for controlling the familywise error rate.

Standard single-step one-biomarker-at-a-time interaction tests
In the context of randomized clinical trials, one can test each biomarker in turn for a biomarker-treatment interaction using the following linear model: with denoting the response outcome, the binary treatment-control indicator, and 1 , … , representing the values of biomarkers, for the th patient. The null hypothesis × = 0 could be tested for each = 1, … , , for example, using a Wald test with the Bonferroni correction applied to preserve the familywise error rate.
The number of biomarkers to be considered is potentially large. Given the desired overall familywise error rate , a Bonferroni correction (Dunn, 1961) requires an adjusted significance level for each individual test to be ∕ . Although the Bonferroni correction is typically used for its simplicity and flexibility, with regard to our interest in high-dimensional interaction testing it is worth exploring whether other procedures are able to provide improved efficiency. In Web Appendix A, we demonstrate theoretically some alternative familywise error rate controlling methods (Šidák, 1967;Holm, 1979) can only provide a small improvement across the settings we consider in this paper: when is large and only a small subset of biomarkers have true interactions with treatment.

2.2
Two-stage interaction tests with some existing screening methods Two-stage approaches use a screening test as a filtering stage (stage 1) to select a subset of biomarkers, and then in stage 2, only test interactions within the reduced set of biomarkers, thus increasing power. To preserve the overall familywise error rate, two-stage approaches rely on the stage 1 screening tests being independent of the final stage 2 tests.
A common stage 1 screening test used in two-stage interaction testing is a marginal association test (Kooperberg and LeBlanc, 2008). Considering this type of screening test in the clinical trial setting, the marginal effect of a biomarker on the outcome can be measured in a regression model of the form The screening procedure is conducted by testing the null hypothesis = 0 for = 1, … , , with a prespecified significance level 1 ∈ (0, 1). In stage 2, one then tests interactions using the one-biomarker-at-a-time model (1) within the set of biomarkers selected at stage 1. Another way to utilize stage 1 information is to test all biomarkers in stage 2 using weighted significance levels, that add up to the targeted error rate , based on ordered biomarkers from stage 1. One possible weighting scheme (Ionita-Laza et al., 2007) is as follows: the most significant biomarkers, that is with the lowest -values in stage 1, are compared with an adjusted significance level ( ∕2)∕ , the next 2 biomarkers are compared with ( ∕4)∕(2 ), …, the next 2 biomarkers are compared with ( ∕2 +1 )∕(2 ), and so on.
The motivation of conducting marginal association tests to screen for candidate interaction tests is that we expect a biomarker that has an interaction with the treatment for the disease will also show some level of marginal association with the response. However, it is also possible that the biomarker's main association with response and the interaction effect may be in opposite directions. When this is the case, a marginal screening strategy would downgrade due to the first stage test statistic having low power.
To preserve the overall familywise error rate, a key requirement to apply the two-stage approach is the independence between stage 1 and 2 tests. Both Murcray et al. (2008) and Dai et al. (2012) proved the following: with stage 1 and 2 test statistics being asymptotically independent and * defined as the number of stage 1 selected biomarkers, using a Bonferroni adjusted significance level = ∕ * at stage 2 to test interactions within the reduced set is sufficient to preserve the overall familywise error rate of the two-stage procedure under .
In the context of gene-environment interaction studies, an alternative type of screening is testing the correlation between a gene and the environmental factor (Murcray et al., 2008). This type of screening requires case-control sampling for a rare response endpoint, thus it can be useful for detecting biomarker-treatment interactions in large prevention trials. However, such a screening procedure is not generally applicable in randomized clinical trials, where the rare response condition does not hold. In this case, the trial population represents the entire dataset and cases (responders) are not "oversampled." We make this argument and also discuss the applicability of other related proposals more formally in Web Appendix B.

2.3
New stage 1 penalized regression screening procedure accounting for biomarker-biomarker correlations One drawback of existing two-stage interaction testing procedures is that biomarkers are only screened one at a time in stage 1. This ignores correlations between the biomarkers. In a high-dimensional, low-sample-size dataset, an ordinary multivariate regression analysis testing each predictor, while accounting for correlations with the other predictors, is not feasible. Therefore, we considered penalized regression methods to model correlated high-dimensional data. These techniques have improved the development of risk predictors from high-dimensional genomic information (Wu et al., 2009;Newcombe et al., 2017).
We propose a new stage 1 multivariate screening test of the following form to account for biomarker-biomarker correlations: This multivariate version of the marginal association screening test also includes the treatment main effect term. This is necessary to preserve the independence between stage 1 screening and stage 2 interaction tests as described later.
To fit this multivariate model, we use ridge regression, which applies regularization to avoid overfitting in high-dimensional low-sample-size problems. Typically, the objective of ridge regression is to minimize a loss function along with an 2 regularization term: ( ) + || || 2 2 , where || || 2 2 = 2 + ∑ =1 2 and is the regularization parameter. Ridge shrinks all the estimated coefficients towards zero, but will not set them exactly to zero. For use in a two-stage interaction testing strategy, we propose ordering the biomarkers based on the ridge coefficients obtained from stage 1, and then use the resulting ranking to determine varying significance thresholds across buckets of markers during stage 2 one-at-atime interaction tests according to the weighting scheme described in Section 2.2.

Proof of independence between stage 1 penalized regression screening and stage 2 standard interaction tests
In this section, we show that independence between stage 1 and stage 2 test statistics holds for stage 1 ridge regression screening tests.
For the th subject, let denote the outcome variable, = ( , 1 , … , ) be a vector of the binary treatmentcontrol indicator and biomarkers. Consider the proposed stage 1 marginal association screening test based on the multivariate model of the form where = ( , 1 , … , ) . The model underlying the stage 2 standard one-biomarker-at-a-time interaction test is of the form where = ( , , ) and = ( , , × ) . The above forms ignore intercepts without loss of generality. Homogeneity of variance is assumed, that is var( | ) and var( | ) are assumed to be constants. We first show the between-stage asymptotic independence for the stage 1 multivariate regression marginal association estimator without regularization.
The proof is provided in the Appendix. Previous works (Dai et al., 2012) have demonstrated that the stage 1 univariate marginal association screening tests are independent of the stage 2 one-biomarker-at-a-time interaction tests. Theorem 1 extends this to show independence still holds when stage 1 tests are extended to a multivariate regression. Our proof relies on (1) the inclusion of the treatment main effect in the multivariate regression of the form (3); (2) an assumption of independence between the treatment assignment and biomarker values, which is valid in randomized clinical trials. The proof in Dai et al. (2012) for the univariate marginal association screening tests is more general; it does not depend on biomarker-environment independence, and it also holds for generalized linear models.
Next we establish the asymptotic distribution of the ridge estimator.
in distribution, whereˆis the ridge estimator,  is a normal distribution, and Σ are a constant and an invertible constant matrix.
Based on the asymptotic results derived in Lemma 1 and Theorem 1, we are able to prove the asymptotic independence between the stage 1 ridge marginal association screening estimator and the stage 2 one-at-a-time interaction estimator in the following corollary.
Proofs of Lemma 1 and Corollary 1 are given in Web Appendices C and D.

Simulation study
To evaluate performance of our proposed biomarkertreatment interaction testing procedure described above, we generated simulated data sets, each having = 1000 biomarkers. Data were simulated under the model = 0 + + ∑ =1 ( + × × ) + , where the treatment main effect was set to = 0.5 and the intercept 0 = 0. We partitioned the 1000 biomarkers into 50 clusters of correlated biomarkers, containing 20 biomarkers each. We denote the clusters 1 = { 1 , … , 20 }, 2 = { 21 , … , 40 }, and so on. One biomarker in the first cluster was ascribed a main effect and an interaction effect, that is 1 = 0.5 and 1 × = 1. Four other biomarkers in four other different clusters were ascribed main effects on the trait without interactions, that is 21 = 41 = 61 = 81 = 1.5. All other biomarkers do not have direct effects on the outcome. Each biomarker was generated from a standard normal distribution  (0, 1), and the binary treatment assignment was drawn from a (0.5) distribution, while was generated from a normal distribution with standard deviation 5. In this case, the proportion of variance explained by the true model is 0.292. We consider two types of correlation patterns among biomarkers: (1) The 20 biomarkers within each cluster are correlated with each other ( = 0.6), but there are no correlations between biomarkers in different clusters; (2) all biomarkers are independent of one another ( = 0). For each scenario, 1000 replicate datasets were generated to estimate power and familywise error rates. Power for all the approaches is defined according to the idea of "cluster discoveries" in Brzyski et al. (2017) as (reject at least one 0 for any ∈ | at least one 1 is true for any ∈ ), where 0 is the null hypothesis for and 1 is the alternative hypothesis for .
Four different screening procedures are compared: (1) "Univariate screening (threshold)": A selection of biomarkers to take forward to stage 2 is based on significance in a regression of response on the biomarkers one at a time, of the form (2). A significance level 1 = 0.05 is used without adjustment for each stage 1 biomarker test.
(2) "Univariate screening (rank)": All biomarkers are taken forward to stage 2, and the stage 1 -value ranking is used to conduct a stage 2 weighted hypothesis test described in Section 2.2 with = 5 {a number recommended by Gauderman et al. (2013)}. (3) "Ridge screening (rank)": Ridge regression is used to estimate marginal effects at stage 1. Then all biomarkers are ordered based on these stage 1 coefficients, and the rank will be used by the stage 2 weighted hypothesis test with = 5. The optimal is chosen based on fivefold cross-validation errors. The R package glmnet (Friedman et al., 2010) was used. (4) "No screening": A standard single-step interaction test of the form (1), targeting an overall familywise error rate = 0.05, is performed as a baseline comparator (with a Bonferroni correction applied with = 1000) and also as the stage 2 test for all three two-stage approaches described above.
In Figure 1(A), with highly correlated biomarkers, the proposed ridge regression screening procedure demonstrated substantially higher power than the univariate screening procedures, showing a clear benefit of accounting for correlations between the biomarkers at stage 1. For the univariate screening procedures, all the biomarkers with univariate marginal signals, including 1 , … , 100 , were likely to be retained after screening in the "threshold" approach or land into the top buckets at stage 2 in the "rank" approach. In contrast, the ridge-screening procedure considered the effect of each biomarker, adjusted for all other biomarkers, and therefore tended to ascribe less evidence to biomarkers whose marginal associations were exaggerated by correlation with the true signal(s). Thus, biomarkers with true marginal associations, which are more likely to have interactions, tended to be ranked in the top buckets because of accounting for biomarkerbiomarker correlations at stage 1. This enhanced the power of the overall two-stage approach compared with using the univariate screening procedures. In Figure 1(B), with independent biomarkers, where the multivariate regression is not required for unbiased effect estimation, the univariate screening and the ridge-screening procedures using weighted hypothesis tests perform similarly. All three two-stage tests outperformed the single-step interaction test by providing better power at the same familywise error rate level whether biomarkers are correlated or independent.
In Figure 1(C), we simulated scenarios with one biomarker having an interaction, no correlations among the biomarkers, and changed only the main effect of the interacting biomarker 1 , that is main effects of the other four biomarkers were the same as the previous scenario. The sample size was fixed at 1500. Figure 1(C) reveals that there are some special cases, in which the main and interaction effect parameters are in opposite directions such that they cancel out, where all two-stage approaches give lower power than a single-step test.
In Figure 1(D), we used the previous scenario with one biomarker having an interaction (biomarker correlation = 0.6, sample size of 1500) as the base and changed only the main effects of the four biomarkers with main effects alone 21 , 41 , 61 , 81 . Figure 1(D) shows that power of all four tests decreases with increasing effect sizes of main-effect only biomarkers, because the proportion of variation explained by the interactioneffect biomarker decreases. The univariate screening using weighted hypothesis testing performs worse than the single-step test when effect sizes of four main-effect biomarkers become too large. This is because a large number of biomarkers that only have marginal associations, and no interaction, tend to fall into the top buckets, thus the bucket size allocated to the true interaction signal can lead to a more stringent significance threshold than that allocated by the single-step test through the Bonferroni adjustment accounting for all biomarkers. The ridgescreening strategy still outperforms the single-step test, despite the biomarkers with marginal effects only exhibiting very strong stage 1 associations; their many correlated proxies are still screened out through multivariate modeling.
In Web Appendix E, we summarize familywise error rates in different scenarios, which shows no inflation for all the screening procedures. We also provide additional simulation results. Relative patterns of performance among the screening strategies were consistent with the results described above, demonstrating the robustness of our method and findings. F I G U R E 1 Comparison of two-stage interaction tests with different screening testing procedures. Four were compared: univariate screening (threshold) (long dashes), univariate screening (rank) (short dashes), ridge screening (rank) (dot-dash), and no screening (solid). The four panels represent: (A) highly correlated biomarkers ( = 0.6), (B) independent biomarkers ( = 0), (C) independent biomarkers ( = 0, sample size of 1500), changing the main effect of the interacting biomarker 1 , and (D) highly correlated biomarkers ( = 0.6, sample size of 1500), changing the main effects of the four biomarkers 21 , 41 , 61 , 81

Data applications
In addition to validating our methods through simulations, we exemplified our approaches in two real data applications. We first applied our approaches to data from the randomized controlled trial START (Fonagy et al., 2020), which is composed of 684 participants aged from 11 to 17 with antisocial behavior, half of whom were treated with management as usual (the control arm) and the rest were treated with multisystemic therapy followed by management as usual (the treatment arm). We used a secondary outcome of this trial, the 18 months' follow-up outcome from Inventory of Callous and Unemotional Traits, as the continuous outcome and applied our interaction testing procedures to detect covariates having interactions with the treatment. We excluded covariates with more than 10% missing data and used mean imputation to replace missing values for covariates with less than 10% missing data. As a result, 75 covariates were included in the analysis. Correlation among these covariates is generally low (a correlation plot is provided in Web Appendix F).
We performed all four screening procedures described in the previous section with a significance level of = 0.05 and did not find any significant interactions. The top covariates from each of the univariate screening and ridge-screening procedures are presented in Table 1, which shows that the selected covariates from these two procedures are similar in this dataset where covariates have low correlation.
In the second application, we applied our approaches retrospectively to a publicly available dataset with highdimensional gene expression biomarkers (the PREVAIL trial) . The dataset is a phase II randomized trial, which aimed to evaluate the efficacy of lactoferrin as a preventative measure for hospitalacquired infections. Gene expression data are available for 61 patients from the National Center for Biotechnology Information (NCBI) website (GSE118657). Of the 61 patients, 32 patients were in the lactoferrin group, and the remaining patients were in the placebo group. We used the Sequential Organ Failure Assessment score measuring change in organ function postrandomization as the continuous response endpoint. From a total of 49,495 genes, we restricted our analysis to the 10,000 probes with the highest variability.
All four methods described in the previous section with a significance level of = 0.05 did not find any significant biomarker-treatment interactions. A list of the top biomarkers from different marginal screening procedures is presented in Table 1. The rankings of selected covariates are notably different between the ridge regression screening and the univariate screening procedures, likely owing to the high correlation among the biomarkers. In addition, we examined the empirical correlation between stage 1 ridge screening and stage 2 interaction test statistics applied in the above two real datasets. Table 2 summarizes results from Pearson correlation tests, which shows that the empirical correlation between stages is close to zero and in all cases the 95% confidence interval contains zero as expected.

DISCUSSION
We propose, for the first time with formal justification, the use of ridge regression in a two-stage interaction testing framework for identifying biomarker signatures of treatment efficacy in randomized clinical trials. Interaction testing frameworks which are designed to scale to large numbers of covariates will become ever more important as omics technologies continue to drop in price and become routinely measured in clinical trials. Naturally, there will be variation in the level of correlation among different sets of omics-based biomarkers from one setting to the next. For instance, when there is a strong a priori hypothesis of which genes influence treatment efficacy, such that a panel of genetic markers are all taken from the same region, pairwise correlations will be stronger on average compared to a genome-wide panel of variants, because local genetic correlations tend to be much stronger than long-range correlations (known as linkage disequilibrium decay). Similarly, considering transcriptomics, correlations will be stronger when focusing on a subset of genes that correspond to the same pathway. Therefore, the ridge-screening approach will be particularly well motivated when related biomarkers of a priori interest have been preselected, for instance, from a gene region or pathway. These biomarker sets will tend to exhibit the strongest correlation structures, and so will benefit the most from multivariate modeling during stage 1 screening. It is known that ridge regression has a tendency to average effects across strongly correlated covariates. This phenomenon is not desirable for a screening strategy since it could inflate the number of noninteracting biomarkers being put forward to stage 2. Thus, lasso (Tibshirani, 1996), as an alternative penalized regression model, which does not exhibit this effect-averaging behavior, may be expected to perform better. However, as lasso uses a 1 penalty, which is not a smooth function, it is challenging to prove it meets the between-stage independence requirement to preserve the overall familywise error rate in two-stage approaches. Since the main goal of employing the penalized regression screening procedures in stage 1 is to account for biomarker-biomarker correlations, some less computationally intensive multiple testing correction methods for correlated tests might be beneficial (Nyholt, 2004;Gao et al., 2008). However, applying such methods which calculate an "effective" number of independent tests to the single-step interaction test in a limited set of simulations did not offer any power improvement when controlling for the same familywise error rate (results not shown). We suggest further investigation in how to incorporate these methods into the two-stage interaction framework including a formal justification of the familywise error rate control as a topic of future work.
We also showed that there exist special cases where our proposed two-stage screening strategy offers no benefit, for example, the case when the main effect of a biomarker and its interaction effect with the treatment to the response is in opposite directions, which reduces the strength of the marginal association (sometimes leaving no detectable marginal effect) for true interactions. We suggest exploring the weighting scheme thus changing how much stage 1 information to be used in the following stage 2 tests as a future topic for investigation. Another technical caveat was shown by Sun et al. (2018) that, for logistic regression, the interaction estimator under treatment misspecification can be biased when the biomarker is associated either indirectly or directly with the outcome. This is a generic issue to interaction modeling using logistic regression, but could manifest in our framework as an elevated familywise error rate at stage 2 one-biomarker-a-time tests. Therefore, we highlight that, currently, our theoretical work only guarantees familywise error rate control when using linear regression. The extent to which this bias might inflate familywise error rates when applying our framework using logistic regression, and potential corrections, will be the topic of future work.

A C K N O W L E D G M E N T S
This work was funded by the UK Medical Research Council (grant number MR/R502303/1 to J.W., grant number MC_UU_00002/9 to A.P. and P.J.N., grant number MC_UU_00002/6 to J.M.S.W.). P.J.N. acknowledges support from the NIHR Cambridge Biomedical Research Centre. The authors thank the START trial investigators for use of their data.

D ATA AVA I L A B I L I T Y S TAT E M E N T
START data can be accessed through the procedure described in Fonagy et al. (2020). PREVAIL data were derived from the NCBI website (https://www.ncbi.nlm. nih.gov/geo/query/acc.cgi?acc=GSE118657) .

S U P P O R T I N G I N F O R M AT I O N
Web Appendices referenced in Sections 2 and 3, and the R code for simulation studies and data applications are available with this paper at the Biometrics website on Wiley Online Library. Proof of Theorem 1 Based on the unified approach to proving between-stage asymptotic independence by Dai et al. (2012), we need to evaluate the covariance matrix We consider the second and the third terms: Thus, for the ( + 1) × 3 matrix ( ) ( ) −1 , the ( + 1)th element ( = 1, … , ) of the last column is proportional to which uses the independence between and and the assumption ( ) = 0 or ( ) = 0. Similarly, the first element of the last column is also zero.