A general framework for subgroup detection via one-step value difference estimation

Recent statistical methodology for precision medicine has focused on either identification of subgroups with enhanced treatment effects or estimating optimal treatment decision rules so that treatment is allocated in a way that maximizes, on average, predefined patient outcomes. Less attention has been given to subgroup testing, which involves evaluation of whether at least a subgroup of the population benefits from an investigative treatment, compared to some control or standard of care. In this work, we propose a general framework for testing for the existence of a subgroup with enhanced treatment effects based on the difference of the estimated value functions under an estimated optimal treatment regime and a fixed regime that assigns everyone to the same treatment. Our proposed test does not require specification of the parametric form of the subgroup and allows heterogeneous treatment effects within the subgroup. The test applies to cases when the outcome of interest is either a time-to-event or a (uncensored) scalar, and is valid at the exceptional law. To demonstrate the empirical performance of the proposed test, we study the type I error and power of the test statistics in simulations and also apply our test to data from a Phase III trial in patients with hematological malignancies.

Correspondence: Dana Johnson, United Therapeutics Corp., 55 TW Alexander, Research Triangle Park, Durham, NC 27709, USA.djohnson@unither.com.SUPPORTING INFORMATION Web Appendices, Tables, and Figures referenced in Sections 2-6 are available with this paper at the Biometrics website on Wiley Online Library along with R code for performing RCT and observational study simulations for an uncensored outcome under Model 1, and for a censored outcome under Model (1/2) (a/b/c).Data S1 Figure 1.Diagram of how the data are partitioned into chunks."obs."stands for observations.Figure 2. Classification tree for subgroup membership under the ACTG175 data analysis with zidovudine + didanosine as treatment.Figure 3. Density plot of age among patients predicted to benefit from zidovudine + didanosine (dark grey fill), and among patients predicted not to benefit from zidovudine + didanosine (light grey fill).

| INTRODUCTION
Consider the setting in which an active treatment and a control are to be compared on the basis of some outcome-which may be continuous, discrete, or a possibly censored time-to-event.Due to patient treatment effect heterogeneity, it is possible that a subgroup of patients benefit from the active treatment, but the overall estimated treatment effect is nonsignificant or borderline significant.When this happens, one may want to "salvage" the treatment by testing whether at least some subgroup benefits from treatment, and, if the null hypothesis of no benefit is rejected, identifying the subgroup(s) with an enhanced treatment effect (Lipkovich et al., 2017).For example, Lipkovich et al. (2017) describe a Phase III clinical trial in patients with hematological malignancies that resulted in a hazard ratio (treatment vs. control) for overall survival that was borderline significant (one-sided p = 0.0367).Due to this borderline result, the sponsor decided to search for subgroups with improved benefit/risk ratios.To handle situations similar to those described above, statistical methods from the precision medicine literature are often applied.This literature mainly focuses on one of two things: estimation of an optimal treatment regime, or identification of subgroups with enhanced treatment effects.Various parametric or nonparametric methods have been developed for the estimation of optimal treatment regimes with both uncensored outcomes (eg, Watkins and Dayan, 1992;Murphy, 2003;Zhang et al., 2012a, b;Zhao et al., 2012) and censored time-to-event outcomes (eg, Bai et al., 2017;Goldberg and Kosorok, 2012;Jiang et al., 2017;Zhao et al., 2015).Likewise, intensive research has been done for identifying subgroups with enhanced/differential treatment effects (eg, Cai et al., 2010;Foster et al., 2011;Lipkovich et al., 2011;Li et al., 2019;Wang et al., 2019;Zhao et al., 2013).However, less work has been done in the area of subgroup testing.Here the focus is to test whether or not one treatment is better than the other for at least some subgroup of the population.As argued by Shi et al. (2020), this is a foundational question that needs to be answered prior to either estimating an optimal treatment regime or searching for subgroups with enhanced treatment effects.Existing subgroup testing literature includes parametric mixture models (Shen and He, 2015;Wu et al., 2016) and change-plane approaches (Fan et al., 2017;Kang et al., 2017;Wei and Kosorok, 2018).Some limitations of these approaches are that the subgroup is assumed to have some parametric form, such as defined by a linear function, and/or the treatment effect is assumed to be constant within the subgroup.Such assumptions may be restrictive in practical applications.These challenges motivate us to develop a general framework for testing for the existence of a subgroup with enhanced treatment effects, which does not require specification of the form of the subgroup and allows heterogeneous treatment effects within the subgroup.
Our proposed test statistics are built based on estimators of the value difference between an estimated optimal regime and a fixed regime that assigns everyone to the same treatment.Both uncensored outcomes and censored time-to-event outcomes are considered.To derive the asymptotic distributions of the test statistics under the null hypothesis, we adapt the onestep estimation procedure proposed by Luedtke and van der Laan (2016, LL16 henceforth), which can lead to valid inference even under the exceptional law, that is, when the treatment effect of individual patients can be zero with a positive probability.In addition, within the one-step estimation framework, we consider two schemes to partition observations into smaller chunks of data and show that one particular partition can enhance the power of the considered test statistic under certain settings.In Section 2, we introduce notation and assumptions.In Section 3, we propose several value difference estimators for both uncensored and censored outcomes.In Section 4, we introduce our test procedure.We study the performance of the proposed test statistics via extensive simulations in Section 5 and apply the proposed test to the Phase III trial data described in Lipkovich et al. (2017) in Section 6.

| STATISTICAL FRAMEWORK FOR SUBGROUP TESTING
We first introduce the framework in the case of an uncensored outcome.Let X be a p × 1 vector of baseline covariates taking values in X.Let Y be an observed, continuous outcome coded so larger values are preferred, and let A be a binary treatment indicator, where A = 0 corresponds to control/standard of care, and A = 1 corresponds to the active treatment.The observed data consist of independent and identically distributed (iid) copies of O = X, A, Y , indexed by i = 1, …, n.Let Y *(a) be the potential outcome (Rubin, 1974) a patient could experience if he/she were to receive treatment option a, a = 0, 1.We assume that where C 0 (X) is the expected conditional outcome if the control were assigned to all patients with X in the population, and τ(X) = E{Y *(1) | X} − E{Y *(0) | X} is the conditional average treatment effect (CATE).
We assume the stable unit treatment value assumption (SUTVA) holds, which implies that Y = Y *(A).In addition, we assume both the positivity assumption and the no unmeasured confounders assumption hold, which state that pr(A = a | X = x) > 0 for all x ∈ X and that {Y *(0), Y *(1)} ⫫ A | X, respectively, where ⫫ means "independent" and the notation ⋅ ⫫ ⋅ | X refers to conditional independence given X.Under these assumptions, it can be shown that We are interested in testing whether or not the active treatment is beneficial, relative to control, for at least some patients in the population.Therefore, we are interested in testing H 0 : τ(x) ≤ 0 for all x ∈ X vs. H 1 : ∃X 1 ⊂ X with pr X ∈ X 1 > 0 such that τ x > 0 for all x ∈ X 1 ; (2) where, importantly, we allow a portion of the population to not experience a treatment effect, that is, we allow pr τ(X) = 0 > 0. This is referred to as the exceptional law.Let d be a treatment rule that maps information in X to a treatment decision in {0, 1}.The where 1(⋅) is the indicator function with 1(B) = 1 if the event is true and = 0 otherwise.The value of a treatment rule is defined as and an optimal treatment rule, Under our considered model and assumptions, the optimal treatment rule is given by d opt (X) = 1{τ(X) > 0}.Note that when τ(X) = 0 the defined optimal rule selects the control for unique representation.Let d(X) ≡ 0 denote the regime that gives everyone treatment 0. Define Ψ d opt = V d opt − V (0), which is the value difference under the optimal rule and the fixed regime d(X) ≡ 0. It can be shown that Ψ d opt = 0 under H 0 , while it is positive under H 1 .This motivates us to use estimators of the value difference as test statistics for testing the considered hypothesis in (2).If the outcome of interest is a right-censored time-to-event, the above framework can be modified.The observed data become O = {X, A, U, Δ}, where T is the event time of interest, C is the censoring time, U = min(T , C), and Δ = 1(T ≤ C).The definition of the CATE changes from a difference in conditional means to a difference in conditional restricted mean survival times.Let T *(a) denote the potential time to event had treatment a been administered, and let L be such that pr(U ≥ L) > 0. The restricted mean survival time (RMST) under treatment a, conditional on X, is defined as where In addition to the positivity assumption and extensions of the SUTVA and no unmeasured confounders assumption to time-to-event data, we assume that censoring is noninformative, that is, that C ⫫ {T *(0), T *(1)} | X.Under these four assumptions, τ(X) can be identified as a function of the observed data (see Web Appendix A in the Supporting Information).We consider the same hypothesis as defined in ( 2), but the value function is now defined based on the restricted mean survival time, that is, V (d) = E(min[T *{d(X)}, L]).

| Value difference estimators for an uncensored outcome
To construct estimators for the value difference Ψ d opt , we use the inverse propensity score weighted estimators proposed by Zhang et al. (2012b) for the value under the optimal rule and the fixed regime d ≡ 0, respectively.Specifically, define and , and S • serves as a placeholder for either S IPW or S AIPW .
In both Ψ IPW and Ψ AIPW the estimator for V d opt , which corresponds to the first three terms of (3) and the first term of (4), is an augmented inverse probability weighted (AIPW) estimator, whereas the estimator for V (0), which corresponds to the last term in ( 3) and ( 4), is an inverse probability weighted (IPW) estimator in Ψ IPW , but an augmented inverse probability weighted estimator in Ψ AIPW .When constructing the value difference estimators Ψ IPW and Ψ AIPW , the functions π, C 0 , and τ are estimated from the data.The details will be given in the next section.
Remark 1.-Inspection of (4) reveals that Ψ AIPW is exactly equal to 0 whenever 1 τ X i > 0 = 0 for all i.In the regular setting, that is, pr{τ(X) = 0} = 0, we have τ(X) < 0 with probability 1 under H 0 .Therefore, Ψ AIPW under H 0 would be exactly equal to 0 asymptotically because τ X i will become negative for all i as n goes to infinity, as long as τ is a consistent estimator of τ.This implies that the limiting null distribution of Ψ AIPW will be degenerate under the regular setting.On the contrary, the limiting null distribution of Ψ IPW will still be normal because the IPW estimator of V (0) used in Ψ IPW does not exactly cancel with the AIPW estimator of V d opt under the null.

| Value difference estimators for censored outcome
and and ζ i equals 1 in (5) and is a mean one perturbation term in ( 6) and ( 7); see Remark 2. We use .The estimator in ( 7) is called a "CAIPW" estimator because it is an AIPW estimator with additional censoring augmentation terms (the "C" refers to censoring).Specifically, the second and third terms on the right-hand side of (7) are censoring augmentation terms under d opt and d(X) ≡ 0, respectively, which can improve the efficiency of the value difference estimator by recovering information on censored but regime-consistent patients.It can be shown that the CAIPW estimator is the locally efficient estimator for the value difference (Bai et al., 2013) and has a desirable double robustness property-meaning if either the propensity model and the model for the censoring distribution are correctly specified, or if the conditional distribution of T i | X i , A i is correctly specified, Ψ CAIPW S is a consistent estimator for the value difference.If the data come from a randomized trial, so that π and the censoring distribution can be correctly specified, then all of Ψ IPW S , Ψ AIPW S , and Ψ CAIPW S are consistent estimators for Ψ d opt (Bai et al., 2017).
Remark 2.-For reasons described in Web Appendix B of the Supporting Information, ζ i needs to be included in ( 6) and ( 7 ) (described in Section 4.2) to be exactly equal to zero even at the exceptional law-unlike for an uncensored outcome, for which this is a concern only in the regular setting.

| Estimation of nuisance functions
The value difference estimators presented in Sections 3.1 and 3.2 are not directly useful because τ(X), C 0 (X), π(X), K C r | X i , A i , dM c r | X i , A i , and m r | X i , A i are usually unknown functions.For an uncensored outcome, we propose estimating C 0 (X) and τ(X) using random forests (Breiman, 2001).Specifically, we use the counterfactual synthetic random forest approach described by Lu et al. (2018).Refer to Web Appendix C for the justification for using this approach and for details on how this estimator is computed.While choosing a nonparametric method places less assumptions on the form of C 0 (X) and τ(X), a concern with using these methods is the "curse of dimensionality."Random forests tend to be more robust to the curse of dimensionality (can handle larger p) than, say, kernel regression.For a censored outcome, we propose estimating C 0 (X) and τ(X) using random survival forests (Ishwaran et al., 2008).Refer to Web Appendix D for details.We estimate K c (t | X, A) by fitting a Cox proportional hazards model with U as the observed time and 1 − Δ as the status.The censoring augmentation term (the part that is an integral), which is a function of nuisance functions, can be rewritten as Refer to Web Appendix E for the proof.In (8), the estimator for the conditional cumulative censoring hazard, which we denote Λ c (r | X, A), is obtained from the same Cox proportional hazards model used to estimate K c (t | X, A).Finally, as discussed in Chapter 8 of Tsiatis et al. (2020), m(r | X, A) can be written as Thus, an estimator for m r | X i , A i , m ˆr | X i , A i , is obtained by replacing S( ⋅ | X, A) in ( 9) with the random survival forest estimator S( ⋅ | X, A).For both uncensored and censored outcomes, we estimate the propensity score using logistic regression.

| One-step framework
We replace the nuisance functions in the expressions for the value difference estimators in Section 3 with the estimators described in Section 4.1, and we follow the one-step estimation framework in LL16 to handle the nonregularity resulting from estimating d opt under the exceptional law.We describe the one-step framework and associated data partitioning methods (Section 4.3) in the censored outcome context.The steps and notation for the uncensored outcome setting are analogous and are given in the Web Appendix F. We first estimate the nuisance functions using a "chunk" of l n observations.We partition the remaining data into r n = n − l n /m chunks of size m, where m ≥ 1, and l n is chosen such that n − l n /m ∞ as n ∞.Let C j = O i : O i ∈ jth data chunk}, so that each C j for j = 1, …, r n contains m observations.We also let O j all observations in the initial data chunk of size l n .Refer to Web Figure 1.Now define and where We consider the test statistic Here, Ψ S (O) is a weighted average of the chunk-specific value difference estimators, so that Ψ S (O) is an estimator for Ψ d opt .The σ j are estimators for the standard deviation of (see Section 4.3).The asymptotic distribution of Ψ S (O) can be derived using conditions similar to (C1)-(C5) in LL16.Thus, the rejection region for an asymptotically valid, α level test is T • , SBT > z 1 − α , where z 1 − α denotes the 1 − α quantile of the standard normal distribution.Details on how random forests may achieve the convergence rate required by (C4) and (C5) of LL16 are provided in the Web Appendix G.

| Data chunking approaches
There are two ways one might partition the observed data into chunks.One way is to randomly assign observations to a chunk without paying attention to the treatment label, which is the approach in LL16 referred to here as the sequential, blinded-treatment (SBT) approach.Another option is to assign observations to chunks such that the proportion of treatment observations in each chunk preserves the marginal probability of treatment, pr(A = 1), in the study.For example, consider a randomized trial where π = 0.4 is known, and we set the chunk size m = 10.Then observations are first allocated to C 1 through C r n such that each of these chunks contains four treatment observations and six control observations.The remaining l n observations are assigned to C 0 ; although the proportion of treatment observations in C 0 may not equal π, it should be close in large samples.We call this second approach the sequential, averaged-propensity matching (SAP-match) approach.For data from an observational study, the SAP-match approach can be implemented by replacing π with the sample proportion of treatment observations, p ˆ. Thus, for each value difference estimator, there exist two versions of the test statistic-the SBT version and the SAPmatch version.For time-to-event outcomes, we use Notation for uncensored outcomes is analogous.
), where d j opt is a fixed estimator for the optimal rule.A similar result holds in the uncensored outcome setting.
When observations are randomly allocated to chunks, which is the case under the SBT σ j can be estimated as described in LL16.Specifically, taking , which appears in the proof of Theorem 1, is sufficiently small under our choice of m = 10.

| Real-valued, uncensored health outcome
We evaluate the performance of the four versions of the proposed test statistic (T IP W , SAP-match , T AIP W , SAP − match , T IP W , SBT , and T AIP W , SBT ) by generating data from three models of the following form: Y i = C 0 X i + A i τ X i + ε i , ε i are iid N(0,0.25).For all three models, A i are Bernoulli with mean π for randomized study simulations that used the SBT approach; for randomized study simulations that used the SAP-match approach; and Bernoulli with mean π X i for all observational data simulations.Across the three models, the proportion of the population with a treatment effect of exactly zero ranged from around 0.26-0.89.Each simulation involved 500 Monte Carlo data sets with n = 600 or 1000.In all simulations, l n = n/2 and m = 10.We defined c to be the treatment effect among the patients who do not have a treatment effect of 0. When c ≤ 0, H 0 is true; when c > 0, H 1 is true.We considered a range of values for c, with π = {0.4,0.5, 0.6} for the randomized study simulations, and for the observational study simulations propensity score models that, marginally, yield similar probabilities.We used synthetic regression forests from the R package randomForest-SRC (Ishwaran and Kogalur, 2019) to estimate C 0 (X) and τ(X).
We present results under Model 1 with π = 0.5 and n = 1000, where Model 1 takes C 0 X i = 3.18 + 0.2X 1i + X 2i + 0.5X 3i , and τ X i = c1 −0.5X 2i 2 + X 4i > 0 .We generated X 1 , X 2 , X 3 as independent N(0,1) and X 4 , X 5 as independent Bernoulli with mean 0.5.Under Model 1 (scheme 1), about 58.9% of the population had a treatment effect of 0. Results under the other two models and settings are similar, and can be found in Web Tables 4-15 of the Supporting Information.
efficiency gains that may result from using T AIPW instead of T IPW are negated by the random perturbation term in the specification of T AIPW .By comparing Table 2 18 and 24 with Web Table 19 and Table 2) and Model 2a versus 2b (Web Tables 21 and 25 with Web Tables 22 and 26), one can see that as the percentage of censoring increases, the power of the test decreases.Finally, as in the uncensored outcome setting, the proposed test remains valid even as p increases from 5 to 25 (see Web Table 17).In Web Appendix L, we examine in both uncensored and censored data scenarios how the choice of m may affect the power of the proposed test via simulations with m = {2, 5, 10, 20}.The results in Web Table 29 show that power is insensitive to the choice of m.

| APPLICATION TO A PHASE III TRIAL
In the Phase III trial referenced in Lipkovich et al. (2017), 599 patients with hematological malignancies were randomized to either experimental therapy plus best supporting care (active treatment, n = 303) or best supporting care (control, n = 296).The hazard ratio (treatment vs. control) for overall survival was borderline significant (hazard ratio = 0.85, one-sided p = 0.0367).We use the proposed one-step value difference test to test whether at least some subgroup benefits from the active treatment.Specifically, we are testing whether at least some subgroup has improved restricted mean survival time under active treatment, compared to control.We considered the same 14 covariates described in Lipkovich et al. (2017); refer to Web Appendix J for a list of the covariates.Eight patients had unknown prognostic score for myelodysplastic syndromes risk assessment, which was one of the covariates collected.Those eight patients were excluded from the proposed test.We chose L = 510 days such that about 90% of the U i are observed by L. This is similar to how we chose L in our simulations, as it prevents K c, j U i L | X i , A i from being too close to zero.
We let l n = 291 and m = 10, and we randomly allocated the observed data to each of the 31 chunks.Results based on T IPW, SBT S , T AIPW, SBT S , and T CAIPW, SBT S are presented in Table 4. Using the CAIPW estimator, we find evidence (α = 0.05) that at least some patients have a larger RMST under active treatment versus control; however, we do not find such evidence when using the IPW or AIPW estimator.This emphasizes the importance of using the CAIPW estimator, especially when the data are heavily censored.For these data, approximately 83% of the event times were censored.Given that there is evidence of treatment benefit for at least some patients (based on CAIPW estimator), the next step would be to identify the subgroup(s) and/or estimate an optimal treatment regime.While not the focus of this paper, this can be done under a variety of approaches.For example, Laber and Zhao (2015), Tao et al. (2018), andSun andWang (2021) propose various tree-based methods for estimating optimal treatment rules.Refer to Web Appendix K (and corresponding outputs in Web Figures 2 and 3 and Web Table 28) for an additional example, this time involving an uncensored endpoint.
We have proposed a one-step value difference test for the existence of a subgroup with a beneficial treatment effect that yields root-n-rate inference for Ψ d opt = V d opt − V (0).The test is applicable to cases when the outcome of interest is an uncensored, continuous variable or when the outcome of interest is a right-censored time-to-event.While our implementation uses random forests to estimate C 0 (X) and τ(X), provided (C4) and (C5) of LL16 are met, alternative regression estimation techniques can be used.One of the benefits to using random forests is that they are more robust to the "curse of dimensionality" than some other nonparametric estimation methods.Extension of the proposed framework to accommodate more than two treatments may be feasible.For example, if two active treatments ("active 1" and "active 2") are to be compared to a control, one may consider redefining the test statistic to be the arg max of two value difference test statistics-one based on a value difference estimator for active 1 versus control and the other based on a value difference estimator for active 2 versus control.Furthermore, building off the ideas in Zhang et al. (2013) for the uncensored outcomes setting, and Hager et al. (2018) and Jiang et al. (2017) for the censored outcome setting, one may consider extending the proposed subgroup test to multiple treatment stages.These are areas of future work.

TABLE 2
Results under Model 1b, with ) because without ζ i it is likely for Ψ j

*.
and m r | X i , A i , respectively, based on the "historical" data in O j − 1 At step j, these estimators are known, fixed functions.For brevity, expressions for S j, IPW S (O i , O j − 1 * ) and S j, AIPW S (O i , O j − 1 * ) are not given here, but they are analogous to (10).

)
denotes the SBT estimator.The form of the test statistic in (11), under the SAP-match chunking method, is T • , formed using only the data in O j − 1 * .
(μ), standard deviation (σ), and power or type I error (1 σ), and power or type I error (1 − β / α) of the one-step value difference test statistic is based on 500 simulated data sets of n = {600, 1000}.Subscripts IPW, AIPW, and CAIPW specify whether the results are based on / α is 0.05 and 0.08, and 0.02, respectively.

Table 28 .
Results from application to ACTG175 data Table29.Results are based on 500 simulated data sets.
with Table 3 (Web  Tables 19, 22, and 26 with Web Tables 20, 23, and 27), one can see that the power gain from using T CAIPW instead of either T IPW or T AIPW becomes more pronounced as the amount of censoring increases.By comparing Model 1a with Model 1b (Web Tables

TABLE 4
Results from testing the hypothesis in (2) on the Phase III trial data Biometrics.Author manuscript; available in PMC 2023 December 04.