Estimating optimal individualized treatment rules with multistate processes

Multistate process data are common in studies of chronic diseases such as cancer. These data are ideal for precision medicine purposes as they can be leveraged to improve more refined health outcomes, compared to standard survival outcomes, as well as incorporate patient preferences regarding quantity versus quality of life. However, there are currently no methods for the estimation of optimal individualized treatment rules with such data. In this paper, we propose a nonparametric outcome weighted learning approach for this problem in randomized clinical trial settings. The theoretical properties of the proposed methods, including Fisher consistency and asymptotic normality of the estimated expected outcome under the estimated optimal individualized treatment rule, are rigorously established. A consistent closed-form variance estimator is provided and methodology for the calculation of simultaneous confidence intervals is proposed. Simulation studies show that the proposed methodology and inference procedures work well even with small-sample sizes and high rates of right censoring. The methodology is illustrated using data from a randomized clinical trial on the treatment of metastatic squamous-cell carcinoma of the head and neck.


Introduction
Modern precision medicine acknowledges the heterogeneity of the majority of human diseases and aims to develop and deliver therapies that are tailored to the individual patient.
At the heart of these efforts is the development of data-driven individualized treatment assignment rules (Kosorok and Laber, 2019). The purpose of such rules is to provide the right treatment to a given patient (Lipkovich et al., 2017) and, thereby, to improve health outcomes among patients overall. Over the past decade there has been a large number of new statistical methods for the estimation of optimal individualized treatment rules (ITRs) with various types of outcomes, including continuous, binary, survival, and competing risks outcomes. However, to the best of our knowledge, there are currently no methods for the estimation of optimal ITRs with stochastic processes that evolve through multiple discrete states over (continuous) time, also known as multistate processes. Such processes are commonly encountered in studies of chronic diseases, such as cancer and HIV infection, where disease evolution is often characterized by multiple discrete health states. An example is the SPECTRUM trial (Vermorken et al., 2013), a phase III randomized clinical trial on recurrent or metastatic squamous-cell carcinoma of the head and neck, where patient event history was characterized by the states "tumor response" (i.e., tumor shrinkage per Therasse et al. (2000)), "disease progression", and "death". This paper addresses the issue of optimal ITR estimation with right-censored multistate processes data in randomized clinical trial settings.
Multistate process data are ideal for precision medicine purposes as they can be leveraged to improve more refined health outcomes (compared to, e.g., overall or progression-free survival) as well as incorporate patient preferences. Typically, such data contain information about one or more transient health states (e.g., tumor response) that is not fully captured by standard survival or competing risks data. Therefore, they provide more com-prehensive information about more refined health outcomes that may reflect both quantity and quality of life. Improvement of such outcomes may be more desirable than mere life extension to both patients and clinicians. In oncology, for example, such a desirable health outcome is sustained tumor response, which is associated with better quality of life, extended progression-free survival (PFS) time, and a prolonged treatment-free interval (Kaufman et al., 2017). Tumor response was also a desirable outcome in the SPECTRUM trial. Another important outcome is extended quality-adjusted lifetime, which can be defined as the weighted sum of the times spent in a set of desirable health states (Pelzer et al., 2017;Oza et al., 2020). Quality-adjusted lifetime can be defined in multiple different ways according to patient preferences by using different weighting schemes that reflect the needs of different groups of patients. In this way, individual priorities towards quality versus quantity of life can be taken into account.
The methods for optimal ITR estimation can be classified into two broad categories: (i) backward induction methods such as Q-learning (Murphy, 2005;Zhao et al., 2009) and Alearning (Murphy, 2003;Robins, 2004) and (ii) direct-search methods, also known as policysearch methods, such as outcome weighted learning (Zhao et al., 2012) and value search estimation (Zhang et al., 2012). The first class of methods typically estimates the optimal ITR by modeling either the conditional expectation of the outcome given the covariates (Q-learning) or the interactions between treatment and the covariates (A-learning). The second class of methods estimates the optimal ITR directly by optimizing an appropriate objective function. Given that imposing a realistic model for the conditional mean outcome or the treatment interactions is difficult for general multistate processes, the focus in this article is on direct-search methods. The issue of ITR estimation with survival outcomes has been well addressed in the literature. Zhao et al. (2015) extended the outcome weighted learning framework of Zhao et al. (2012) to situations where the outcome of interest is a right-censored survival time by proposing (i) an inverse censoring weighting (ICO) and (ii) a doubly robust (DR) outcome weighted learning approach. Estimation in these approaches relies on a weighted version of support vector machines (Cortes and Vapnik, 1995). A similar method for ITR estimation with survival outcomes was proposed by Bai et al. (2017). A value search estimation approach for maximizing the t-year survival probability was developed by Jiang et al. (2017). This method approximates the nonsmooth objective function of the problem by a smooth one using kernel smoothing. Recently, there has also been some interest in the issue of ITR estimation with competing risk outcomes (Zhou et al., 2021;He et al., 2021). Nevertheless, the methods for survival and competing risk outcomes cannot be used for the estimation of optimal ITRs with right-censored multistate processes.
In this work, we extend the outcome weighted learning framework (Zhao et al., 2012(Zhao et al., , 2015 to deal with situations where the outcome of interest is an arbitrary right-censored multistate process, incorporating also patient preferences. The proposed method does not impose Markov assumptions or model assumptions on the multistate process of interest. The novelty of this paper is twofold. First, we devise a novel objective function which, in contrast to the ICO approach by Zhao et al. (2015), utilizes information from the censored cases. Importantly, this is achieved without imposing and estimating a model for the conditional expectation of the outcome given treatment and the covariates, as opposed to the DR approach by Zhao et al. (2015). Second, in addition to showing Fisher and universal consistency of the proposed method, we establish the asymptotic distribution of the proposed estimator for the expected outcome under the estimated optimal ITR and derive a consistent closed-form variance estimator. Based on our theoretical results, we also propose a method for the calculation of simultaneous confidence intervals over a set of patient preferences to account for multiplicity. The simulation studies provide evidence that the proposed estimator and inference procedures work well even with small sample sizes and under high rates of right censoring. Furthermore, the simulation studies reveal that the proposed method performs better than the previously proposed ICO and DR approaches for censored failure times. where τ < ∞ is the maximum observation time. Let Z ∈ Z ⊂ R p denote a vector of baseline covariates that may be related to the effect of treatment on the multistate process of interest. For simplicity, the treatment A is considered to be a binary variable taking its values in the treatment set {−1, 1}. With multistate processes, the goal of treatment is typically to prolong the time spent in a set of desirable health states. Given that some health states may be more desirable than others and that patient preferences may vary, we define the benefit processes Y w (t) = w ′X (t) : t ∈ [0, τ ], w ∈ W , whereX(t) = (I{X(t) = 1}, . . . , I{X(t) = S}) ′ , w is an S-dimensional vector of preference weights that satisfies 0 ≤ w ≤ 1 and the sum of its elements is positive and less than S, and W = {w 1 , . . . , w M } is a prespecified finite set of preference weight vectors that reflect different patient preferences/priorities. Based on the latter processes, we define the utilities where the integrator function is m(t) = t and induces the Lebesgue measure on the Borel is the time spent in the jth state during the time interval [0, τ ], the utilities (1) represent weighted sums of the (restricted) times spent in each state. For example, consider the states "initial disease state" (state 1), "tumor response" (state 2), and "disease progression or death" (state 3), in the setting of the SPECTRUM trial mentioned in the Introduction. A potential choice for the set of preference weights in this example is W = {(0, 1, 0) ′ , (0.5, 1, 0) ′ , (1, 1, 0) ′ }. When w = (0, 1, 0) ′ , the utility τ 0 Y w (t)dm(t) = τ 0 I{X(t) = 2}dm(t) corresponds to the restricted time spent in the tumor response state. The utility under the choice w = (0.5, 1, 0) ′ corresponds to the restricted quality-adjusted lifetime . This utility represents the restricted (quality-adjusted) time lived, where the time spent in the initial disease state (i.e., without tumor response) is reduced by 50% to reflect a quality of life loss due to disease symptoms and/or side effects due to treatment continuation. When w = (1, 1, 0) ′ , the utility is the (restricted) PFS time, in which the times lived in the initial disease state and the tumor response state are equally important. Different choices of w reflect different patient priorities towards quality versus quantity of life.
In practice, the multistate and benefit processes are not fully observed for all individuals due to the usual right censoring. Let C denote the right censoring time and T * the time of arrival at an absorbing state (i.e., death). Letting T = T * ∧ τ , with a ∧ b = min(a, b) for any a, b ∈ R,T i = T i ∧ C i , and ∆ i = I(T i ≤ C i ), the observed data consist of independent and identically distributed copies of , is the censored version of the multistate process which is equal to 0 for the censored individuals after their censoring time. Note that Y i,w (t) = w ′ (I{X i (t) = 1}, . . . , I{X i (t) = S}) ′ is the benefit process for the ith individual.

Optimal ITR estimation with multistate processes
An ITR is a deterministic function d : Z → {−1, 1} which suggests treatment choice d(z) ∈ {−1, 1} for a patient with covariates Z = z. Since the estimation of an optimal ITR is essentially a causal inference problem, we utilize the potential outcomes causal framework (Rubin, 2005;Tsiatis et al., 2019). Let Y * w (t; 1) and Y * w (t; −1), w ∈ W, denote the potential benefit processes under treatment choices 1 and −1, respectively, at time t ∈ [0, τ ]. Since the potential outcomes cannot be directly observed in real-world settings, we need to impose the following causal assumptions.
Assumptions A2 and A3 are automatically satisfied in randomized clinical trials. It must be highlighted that assumption A1 implies that the observed benefit process Y w (·), unlike the potential benefit processes Y * w (·; 1) and Y * w (·; −1), is associated with treatment A. Now, we can define the potential benefit processes under an ITR d as These processes are essential for defining potential benefit under an ITR for multistate processes. For such processes, we define the value functions as where V w (d) is the expected sum of the weighted (under w) times spent in each state of the process during the time interval [0, τ ], under the ITR d. Under assumptions A1-A3, the independent right censoring assumption, and the fact that it can be shown that the value functions can be expressed in terms of the observable data as where Λ 0 (t) is the cumulative hazard function of the right censoring variable C at time t.
An optimal ITR d * w , w ∈ W, is a maximizer of the corresponding value function, i.e.
Given that any rule d : over all measurable functions f : Z → R. It is not hard to see that for any measurable f . Minimizing R w (f ) over all measurable functions is clearly not feasible. Therefore, we will consider minimization over a subset of the class of all measurable functions f : Z → R. In particular, we consider either of the following subsets.
Minimizing the empirical version of R w (f ), over the chosen class F , is challenging because it involves a discontinuous and nonconvex function of f . To alleviate, we follow the paradigm of outcome weighting learning (Zhao et al., 2012(Zhao et al., , 2015 and support vector machines (Steinwart and Christmann, 2008) and utilize the surrogate risk is the hinge loss, which is convex in f . The cumulative hazard function Λ 0 of the right censoring distribution can be estimated using the nonparametric Nelson-Aalen estimatorΛ An obvious estimator of π 0 iŝ π n = n −1 n i=1 I(A i = 1). Thus, the empirical version of the surrogate risk R φ,w iŝ Note that, even though π 0 is known in clinical trials, the estimateπ n is used in (2) as this typically leads to some efficiency gain in inverse probability weighting type estimators (Tsiatis et al., 2019). The empirical surrogate risk for the ICO estimator for censored failure times (e.g.,T ) by Zhao et al. (2015) for the setting considered here is The latter incorporates inverse censoring weighting in the censored time of interest and discards the censored observations (∆ i = 0). In contrast, the proposed empirical surrogate risk (2) utilizes information from both uncensored and censored cases by incorporating inverse censoring weighting in the underlying stochastic process {Y w (t) : t ∈ [0, τ ]}. Importantly, this is achieved without imposing and estimating a model for the conditional expectation of the time of interest given A and Z, as opposed to the DR approach (Zhao et al., 2015).
Simulation studies summarized in Section 4 reveal that these characteristics of the proposed method lead to a better performance compared to the ICO and DR estimators (Zhao et al., 2015) for censored failure times.
Similarly to Zhao et al. (2012) and Zhao et al. (2015), the estimators of the optimal decision functions within the class F are obtained using penalized minimization ofR φ,w aŝ where λ n is a positive tuning parameter that controls the complexity of f and · is a norm on the chosen class F . For example, · is the Euclidean norm if F is the class of linear functions. For notational simplicity we omit the subscript λ n from the estimated decision functions and use the more compact notationf n,w . Based on the estimated decision functions {f n,w : w ∈ W} the estimated optimal ITRs arê d n,w (z) = sgn{f n,w (z)}, w ∈ W, z ∈ Z.
Based on the class of estimated ITRs {d n,w : w ∈ W}, treatment assignment for a given patient utilizesd n,w , for the weight vector w that is closest, with respect to the Euclidean norm, to the preference weight w 0 that reflects the patient's own preferences/priorities.

Estimation of the value function
The value function of an ITR d can be estimated aŝ The valueV n,w (d) can be seen as the (estimated) performance of the ITR d under the preference weight vector w. The estimated value of the estimated optimal ITRV n,w (d n,w ) is expected to be an over-optimistic estimate of the performance ofd n,w in a future (out of sample) patient when sample size is small relatively to p or relatively to the complexity of F . This is because the estimatorV n,w (d n,w ) uses the same set of data {D i : i = 1, . . . , n}, for both the estimation of the optimal ITR and the evaluation of its performance. This phenomenon is similar to the behaviour of the training error in support vector machines (Hastie et al., 2009). A better estimate of V w (d n,w ) in finite samples is expected to be the jackknife or leave-one-out cross-validation value estimator whered n,w ) is the optimal ITR estimated under the preference weight vector w using all but the data of the ith individual.

Theoretical Properties
The first theorem justifies the use of the surrogate risk R φ,w , instead of the original risk R w , for the estimation of optimal ITR d * w , w ∈ W.
Theorem 1 (Fisher consistency). Suppose that assumptions A1-A3 and condition C1 in Appendix A hold. Then, for any w ∈ W, iff w minimizes R φ,w , d * w (z) = sgn{f w (z)} for all z ∈ Z.
The proof of Theorem 1 can be found in Appendix A.1. The next theorem ensures that R φ,w (f n,w ), which is the true surrogate risk of the estimated decision functionf n,w , converges (in probability) to the minimal surrogate risk over the chosen class F . It also asserts that, if the chosen class F is appropriate, then the proposed estimatord n,w is universally consistent, i.e., its value converges (in probability) to the optimal value V w (d * w ).

Theorem 2. Suppose that assumptions A1-A3 and conditions C1, C3, and C4 in Appendix
A hold. Then, for λ n > 0 with λ n → 0, and nλ n → ∞, as n → ∞, for any distribution P of the data D. Moreover, if (i) F is the space of linear functions and f * w ∈ F or (ii) F is the RKHS with the Gaussian kernel and the marginal distribution µ of Z is regular, then Theorem 2 is proved in Appendix A.2. When F is the space of linear functions, universal consistency requires that the optimal decision function f * w is linear, i.e. f * w ∈ F . This requirement can be made more plausible by considering an enlarged covariate spaceZ that includes polynomial terms and/or two-way interaction terms between the original covariates Z. If f * w / ∈ F , V w (d n,w ) is expected to converge to a value that is lower than the optimal value V w (d * w ). Nevertheless, the limit of V w (d n,w ) can be seen as an approximation to the optimal value V w (d * w ) by the first statement of Theorem 2 and the fact that R w (f * w ) ≤ R w (f ) ≤ R φ,w (f ) for any f ∈ F , since the hinge loss satisfies φ(x) ≥ I(x < 0) for all Interestingly, when F is the RKHS with the Gaussian kernel and the marginal distribution of Z is regular, the estimated ITR is always universally consistent. However, the so-called no-free-lunch theorem (Steinwart and Christmann, 2008) implies that the corresponding rate of convergence can be extremely slow for at least some distributions of the data. This means that an extremely large sample size may be required in practice in order to obtain an ITRd n,w with a value reasonably close to the optimal value. For this reason, we will restrict our attention to the case where F is the space of linear functions in the remainder of this paper.
The next theorem characterizes the asymptotic distribution of the estimated value function of a fixed decision function f . Theorem 3. Under assumptions A1-A3 and conditions C1, C2, and C4 in Appendix A, we have that The proof of Theorem 3 is given in Appendix A.3. This theorem implies that, for any measurable decision function f , G n,w (f ) is asymptotically normal with mean zero and variance σ 2 w (f ) = Eψ 2 1,w (f ). This asymptotic variance can be consistently (in probability) estimated byσ 2 w (f ) = n −1 n i=1ψ 2 i,w (f ), whereψ i,w (f ), i = 1, . . . , n, are the empirical versions of the influence functions. The latter are obtained by replacing expectations by sample averages and unknown parameters by their consistent estimates in ψ i,w (f ). The formula for the empirical influence function is provided in Appendix B.
The final theorem characterizes the asymptotic distribution of the estimated value function ofd n,w = sgn(f n,w ) and provides the basis for the calculation of simultaneous confidence intervals over the preference weight set W = {w 1 , . . . , w M }.
Theorem 4. Suppose that F is the space of linear functions. Then, under assumptions A1-A3, conditions C1-C5 in Appendix A, the additional assumption that the optimal linear decision functionf w (z) satisfies P (f w (Z) = 0) = 0, w ∈ W, and for λ n > 0 with λ n → 0 and nλ n → ∞, we have that The proof of Theorem 4 is given in Appendix A.4. The last theorem and Theorem 3 imply that G n,w (f n,w ) is asymptotically normal with zero mean and variance σ 2 w (f w ). This variance can be consistently (in probability) estimated byσ 2 w (f n,w ) = n −1 n i=1ψ 2 i,w (f n,w ). This result can be used for the calculation of (pointwise) confidence intervals and conducting hypothesis testing regarding the performance of the estimated ITR V w (sgn(f n,w )) under the preference weight w. Theorems 3 and 4 can also be used for simultaneous inference. Define the vector G n := (G n,w 1 (f n,w 1 ), . . . , G n,w M (f n,w M )) ′ and let x ∞ := max 1≤l≤M |x l | denote the maximum norm of a vector x = (x 1 , . . . , x M ) ′ ∈ R M . Theorems 3 and 4, the Cramér-Wold theorem, and the continuous mapping theorem, imply that, for any fixed matrix Q (of appropriate dimension), The covariance matrix Ω can be consistently (in probability) estimated by the matrixΩ n with elementsω l,l = The last result implies that 1 − α simultaneous confidence intervals over W, which account for the multiplicity due to considering multiple preference weights, can be calculated as where c a is the 1 − α percentile of the distribution of QG ∞ . This percentile can be easily estimated using the following simulation procedure. First, choose a large number B (say B = 1000) and simulate vectors G b ∼ N(0,Ω n ), for b = 1, . . . , B. Then, c α can be estimated as the empirical 1 − α percentile of the sample Q n G 1 ∞ , . . . , Q n G B ∞ , V w (−1) are the value functions for the fixed rules d(z) := 1 and d(z) := −1, can be calculated similarly by expanding the vector G n to include G n,w (1) and G n,w (−1), w ∈ W, and using an appropriate matrix Q. This is illustrated in the data application presented in Section 5. We argue that the same inference procedures can also be used for the jackknife value estimatorV jk n,w (sgn(f n,w )), w ∈ W. This statement is justified numerically in the simulation studies.

Simulation Studies
The finite sample performance of the proposed methods was evaluated in a series of simulation experiments. Specifically, we assessed the performance of (i) the proposed ITR estimatord n,w and (ii) the proposed inference methods for the (true) value of the estimated ITR V w (d n,w ). We considered a binary treatment variable A ∈ {−1, 1}, a two-dimensional covariate vector Z = (Z 1 , Z 2 ) ′ , and a multistate process {X(t) : t ∈ [0, τ ]} under a progressive illness-death model with state space S = {1, 2, 3}. State 1 represented the initial disease state, state 2 the tumor response state, and state 3 the disease progression or death state, in a hypothetical oncology trial. We considered the preference weights w 1 = (0, 1, 0) ′ and w 2 = (1, 1, 0) ′ , which correspond to the (restricted) mean duration of tumor response and the PFS time, respectively. Treatment was simulated with P (A = 1) = 0.5, while the covariates were simulated from the uniform distribution U(−1, 1). The multistate process was simulated, conditionally on A and Z, based on the transition intensities In total, we considered four scenarios according to the form of the optimal decision function as follows: The right censoring time was simulated independently of the multistate process from the exponential distribution Exp(θ), with θ ∈ {−1.6, −1, −0.4}, and the total duration of the study was set to τ = 3. These choices for θ and τ led on average to 28.4%, 42.8%, and 59.5% right-censored observations, respectively.
For each scenario and censoring rate θ, we considered the training sample sizes n ∈ {100, 200, 300, 400} and repeated the simulation 1000 times. In each training data set we applied the proposed method with the search space F being the space of linear functions and τ = 3. Note that, f * w / ∈ F in scenarios 3 and 4. The tuning parameter was set to λ n = n −1/2 , which satisfies the requirements of Theorems 2 and 4. To evaluate the performance of the estimated ITRd n,w , we considered two metrics: (i) the estimated ITR is the maximum possible value, and (ii) the misclassification rate, defined as the proportion of patients which were assigned byd n,w to the wrong treatment. An estimated ITR value ratio close to 1 indicates that the performance ofd n,w is close to optimal. For each simulated training data set, the true value of the estimated ITR V w (d n,w ) and the misclassification rate were calculated based on an independently simulated large testing data set of size 10000. To evaluate the validity of the proposed inference methods for V w (d n,w ) we considered: (i) the average percent errors of the value function estimatorsV n,w (d n,w ) andV jk n,w (d n,w ), defined as 1 1000 whereV n,w,b ,V jk n,w,b , andd n,w,b are estimates from the bth simulated training data set, (ii) the average of the proposed standard error estimates relatively to the Monte Carlo standard deviation of the value estimates, and (iii) the coverage probability of the 95% confidence intervals calculated using the proposed standard error estimator under asymptotic normality.
The simulation results regarding the performance of the estimated ITRd n,w for the time in response (i.e., w = (0, 1, 0) ′ ) are depicted in Figure 1. The estimated ITR value ratio was above 0.9 in all cases. Thus, even in Scenarios 3 and 4 where f * w is not linear, the performance of the estimated rule was close to optimal. The maximum fixed rule value ratio are the values for the fixed rules d(z) := 1 and d(z) := −1, was close to 0.8 in all cases. This indicates that the estimated ITRd n,w can lead to substantially better health outcomes on average compared to fixed, one-size-fits-all, rules. As expected, the estimated ITR value ratio was higher with larger training sample sizes and lower censoring rates. A similar pattern was observed for the misclassification rate ofd n,w , which was lower for larger training samples and lower censoring rates. The simulation results regarding the validity of the proposed inference methods for V w (d n,w ) are summarized in Tables 4 and 5. In all cases, the value estimatorV n,w (d n,w ) provided slightly optimistic estimates of the true value V w (d n,w ). The percent error was over 4% only in a few cases with n = 100 and a 60% censoring rate. As expected, the percent error of the jackknife value estimatorV jk n,w (d n,w ) was lower than that ofV n,w (d n,w ). However, the difference between the two value estimators was smaller for larger training samples and lower censoring rates. The average standard error estimates were close to the corresponding Monte Carlo standard deviation of the estimates and the coverage proba-bilities close to the nominal level in all cases. These results indicate the consistency of the proposed standard error estimator and support the asymptotic normality result from Theorems 3 and 4.
Additional simulation results evaluating the effect of selecting different values of τ for the analysis are presented in Figures 1-3 in Appendix C.1. In these simulation studies we considered the values τ ∈ {1, 2, 3}, with 3 being equal to the length of the study. In these studies, larger values of τ led to a better performance in terms of both the median value function and the variability. This is to be expected as more information is incorporated in the proposed method with a larger value of τ . However, the differences between the choices τ = 2 and τ = 3 were not pronounced in general. Further simulation results on the performance of the proposed ITR estimator when F is the RKHS with the Gaussian kernel with σ = 1 (less flexible kernel) and σ = 5 (more flexible kernel), for the duration of tumor response, are presented in Figures 4-6 in Appendix C.1. A larger training sample n led to a better performance in all cases, which reflects the consistency of the proposed ITR estimator. Using a more flexible kernel led to an inferior performance with smaller training sample sizes n. In most cases with n = 200 or n = 400, the performance of a less flexible kernel was the best, while the use of a linear decision function led in general to a better performance compared to two kernel choices when n = 100. However, in many cases, the differences between a less flexible kernel and the linear decision function were not pronounced.
Simulation results regarding the performance of the estimated ITRd n,w for the PFS time (i.e., w = (1, 1, 0) ′ ) are depicted in Figures 7-9 in Appendix C.2. For comparison, these plots also illustrate the performance of the ICO and DR methods (Zhao et al., 2015).
To apply the DR approach, we estimated E(T |T > t, A, Z) based on the semiparametric Cox model for T with A, Z, and the interactions between A and Z as covariates, according to Zhao et al. (2015). This model is misspecified due to the complexity of the multistate Training sample size (n) Scenario 2 Scenario 3 Scenario 4 Figure 1: Simulation study: Performance ofd n,w for the duration of tumor response (i.e., w = (0, 1, 0) ′ , in terms of the average estimated ITR value ratio V w (d n,w )/V w (d * w ) (top row) and misclassification rate (bottom row). The solid grey horizontal lines (top row) correspond to an optimal performance while the dashed grey horizontal lines (top row) correspond to the maximum fixed rule value ratio max{V w (1),

SPECTRUM Trial Data Analysis
The proposed methodology was applied to data from the SPECTRUM Trial (Vermorken et al In this analysis, the value τ was set to 18 months (90th percentile of the follow-up times; there were not many transitions to or from the tumor response state after this timepoint) and the preference weight set was W = {w 1 , w 2 , w 3 } = {(0, 1, 0) ′ , (0.5, 1, 0) ′ , (1, 1, 0) ′ }. As mentioned in Section 2, in this multistate process setup, the utility under the preference weight w 1 corresponds to the restricted mean duration of tumor response, while the choices w 2 and w 3 provide a restricted quality-adjusted lifetime (where the time spent in the initial state is reduced by 50%) and the restricted PFS time, respectively. The covariates considered in this analysis were centered age (in years) at randomization (Z age.c ), indicator that the primary tumor site is hypopharynx (Z hyp ), indicator of history of prior treatment for squamous-cell carcinoma of the head and neck (Z trt.hist ), and indicator that ECOG performance status at baseline indicates symptoms but the patient is ambulatory (vs fully active; Z ECOG ). The tuning parameters λ n,w , w ∈ W, were selected from the candidate set {0.001, 0.005, 0.01, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20, 50, 100} × n −1/2 , where n = 520, using leave-one-out cross validation. All these potential choices for λ n,w satisfy the requirements of the theorems in Section 3. The estimated optimal decision functions werê f n,w 1 (z) = −1.32 − 0.54z age.c + 0.90z hyp + 1.17z trt.hist + 0.83z ECOG , f n,w 2 (z) ≈ −1.40 + 2.00z trt.hist , andf n,w 3 (z) ≈ −1.47 + 2.00z trt.hist , where the absolute values of the estimated coefficients of z age.c , z hyp , and z ECOG , inf n,w 2 (z) andf n,w 3 (z) were all less than 10 −6 . The more complicated form off n,w 1 may reflect that there is higher heterogeneity in treatment effect on tumor response compared to the treatment effect on quality-adjusted lifetime and PFS time. Intuitively, one would expect that this higher heterogeneity with respect to tumor response would be carried over to the other outcomes.
However, this may not be the case because response has typically a short duration for this tumor and might not have a substantial impact on disease progression and/or death. The class of estimated ITRs is {d n,w = sgn(f n,w ) : w ∈ {w 1 , w 2 , w 3 }}. Clearly, the estimated ITRsd n,w 2 andd n,w 3 are equivalent and assign chemotherapy+panitumumab (treatment 1) to the patients with a history of prior treatment, and chemotherapy alone (treatment -1) to those without prior treatment. In contrast, the ruled n,w 1 is more complicated and accounts for more covariates. Next, we estimated the performances (i.e., value functions) of the estimated optimal ITRs and compared them to those of the fixed, one-size-fitsall, rules d(z) := 1 (everyone is assigned to chemotherapy+panitumumab) and d(z) := −1 (everyone is assigned to chemotherapy alone). To account for the multiplicity due to conducting inference about nine parameters in total, we calculated 95% simultaneous confidence intervals using the approach described in Section 3. The percentile c 0.05 was estimated based on B = 1000 simulation replications and the corresponding estimate waŝ c 0.05 = 2.59. The results from this analysis are summarized in . Also, no significant differences were observed between the optimal rule and d(z) := 1. This might be due to a potentially small difference between the effects of chemotherapy alone and chemotherapy+panitumumab among patients for which the two rules assigned different treatment.

Discussion
This article addressed the issue of optimal ITR estimation in randomized clinical trials with right-censored multistate processes. To achieve this, we devised a novel objective function that can handle general nonhomogeneous multistate processes and can easily incorporate patient preferences. A key feature of the proposed methodology is that it utilizes infor- There are two important practical considerations when applying the proposed approach.
First, one needs to specify the preference weight set W. This can be achieved by subject matter experts (e.g., clinicians) or via a survey in a sample of patients. For a given W, the choice of the most appropriate rule within the class of estimated ITRs {d n,w : w ∈ W} for a given patient with a preference weight w 0 isd n,w , with w = arg min w∈W w − w 0 (i.e., the closest preference weight in the set W). Second, the class of decision functions F needs to be chosen. Even though the proposed ITR estimator is universally consistent when F is the RKHS with the Gaussian kernel, the rate of convergence under this choice may be extremely slow (Steinwart and Christmann, 2008). This means that an extremely large sample size may be needed in practice in order to obtain an ITRd n,w whose value V w (d n,w ) is reasonably close to the optimal value V w (d * w ). For this reason, we mainly focused on the class of linear decision functions in most of this article. Under this choice, the proposed estimator is universally consistent only if the true optimal decision function f * w is linear. However, even if f * w is not linear, the value of the estimated ITR V w (d n,w ) can be close to the optimal value. This was illustrated in the simulation studies presented in Section 4. In practice, one can further improve the performance of the estimated ITR when F is the class of linear functions by considering two-way or three-way covariate interactions (Zhou et al.,

2017).
A key assumption of the proposed approach is independent censoring. This assumption is realistic in many clinical trials where accrual time is not associated with patient characteristics and censoring is mainly due to administrative reasons (Zhao et al., 2011;Goldberg and Kosorok, 2017). A plausible relaxation of this assumption is to allow censoring to depend on treatment A, since censoring rate will likely be higher among those receiving the treatment with the greater toxicity (Templeton et al., 2020). A further relaxation of the independent censoring assumption is to allow censoring to depend on both A and Z by imposing a semiparametric Cox model (more details on both relaxations are provided in Appendix E). It must be noted that, for the case of censored failure times, the DR method (Zhao et al., 2015) allows the censoring model to be misspecified (unlike the proposed approach) provided that the failure time model is correctly specified. However, the true censoring model may be of a less complicated form than the true failure time model in clinical trial applications (Goldberg and Kosorok, 2017) and, thus, more likely to be correctly specified.
An interesting extension of this work is to allow for interval censoring. This could be achieved by utilizing an estimator of the state occupation probabilities with intervalcensored data, and deriving an appropriate objective function using, potentially, calculations similar to those in Section 2. Also, extending the proposed approach for the singledecision problem to the multi-decision setting, such as a sequential multiple assignment

Appendix A: Proofs
In this Appendix we provide the proofs of the theorems stated in Section 3 of the main text. The proposed methodology assumes the following regularity conditions. C1. The right censoring time C is independent in the sense that {Y * w (·; 1), Y * w (·; −1), A, Z} ⊥ ⊥ C.
C2. The benefit process has a square-integrable total variation, i.e. E{ C5. For a linear decision functionf w (·) =β 0,w + β 1,w , · that minimizes R φ,w (f ) over the space of linear functions, (β 0,w ,β ′ 1,w ) ′ ∈ B ⊂ R p+1 , where B is a compact and convex set. Moreover, letting Conditions C1, C2, and C4 are standard in the literature of nonparametric methods for survival and multistate process data. A plausible relaxation of condition C1 is to allow censoring to depend on treatment A, since censoring will likely be higher among those receiving the treatment with the greater toxicity (Templeton et al., 2020). This can be trivially incorporated into the proposed methodology by simply estimating nonparametrically the cumulative hazard of censoring separately for the two treatment groups. A further relaxation of the independent censoring assumption is to allow censoring to depend on both A and Z. In this case, one can impose a semiparametric Cox model of the form Λ(t; A, Z) = Λ 0 (t) exp{θ ′ (A, Z ′ ) ′ } for the right censoring time, and use the estimated conditional hazard in the proposed objective function and value function estimators. Provided that this model is correctly specified, the theoretical properties of the proposed method still hold, with the exception that γ i (t) in ψ i,w (f ) (see Appendix B) is replaced by the influence function of √ n{Λ n (t) exp(θ ′ n (a, z ′ ) ′ ) − Λ(t; a, z)} under partial likelihood estimation. Condition C3 is another common condition that ensures that the covariates are bounded.
Condition C5 guarantees the uniqueness of the optimal linear decision functionf w . A similar condition has been previously used in the literature of (unweighted) support vector machines (Jiang et al., 2008).
For notational simplicity, we omit the subscript w, that corresponds to the preference weight, and use the more compact notations Y (t), Y * (t; 1), Y * (t; −1), V(d),V n (d),f n , d n , and ψ(f ), for the remainder of Appendix A. The proofs of Theorems 2-4 rely heavily on empirical process theory (van der Vaart and Wellner, 1996;Kosorok, 2008). For these proofs, we use the standard empirical process theory notation for any measurable function of the data f : D → R, where D is the sample space, and where P is the true probability measure on the Borel σ-algebra on D. for some δ > 0, and the data-dependent function Before providing the proofs of the theorems listed in the main text, we state and prove a useful lemma.  Kosorok (2008). Also, recognizing that the class {I(C ≥ T ∧ t) : t ∈ [0, τ ]} is P -Donsker by Lemma 4.1 in (Kosorok, 2008). Next, see that the class L δ is uniformly bounded by Λ 0 (τ ) + δ, where Λ 0 (τ ) < ∞ by condition C4, and consider the data-dependent function whereT ∈ [0, τ ] and Λ ∈ L δ . Now, the class {f Λ : Λ ∈ L δ } is P -Donsker as a consequence of Lemma 9.11 in Kosorok (2008), as is (trivially) the fixed class {Λ(t) : Λ ∈ L δ , t ∈ [0, τ ]}.

A.1 Proof of Theorem 1 (Fisher consistency)
Using the same arguments to those used in Tsiatis et al. (2019) for the case of a general ITR, it can be easily shown that the optimal ITR in our case satisfies for all z ∈ Z. Now, by Tonelli's theorem (Athreya and Lahiri, 2006) and given that as argued in the main text, the surrogate risk R φ can be expressed as Given that condition C1 implies that Next, by a second application of Tonelli's theorem and assumptions A1-A3 we have The functionf that minimizes also minimizes R φ . The latter function is a (continuous) piecewise linear function which decreases strictly on (−∞, −1] and increases strictly on [1, ∞), almost surely. Therefore, the minimizer should lie in [−1, 1], almost surely. Consequently, for any z ∈ Z, if thenf (z) should be negative. Consequently, by (4), d * (z) = sgn{f (z)} for all z ∈ Z.

A.2 Proof of Theorem 2
First, define g π (D) = Aπ + (1 − A)/2 and for any f ∈ F (where ξ Λ (D) was defined in the beginning of Appendix A), and note that R φ (f ) = P L f,Λ 0 ,π 0 andR φ (f ) = P n L f,Λn,πn . Now, we have that for any t ∈ [0, τ ] and all n ≥ 1. This fact, along with the uniform (outer) almost sure consistency of the Nelson-Aalen estimator (guaranteed by conditions C1 and C4), the facts that τ 0 exp{Λ n (τ )}dm(t) = τ exp{Λ n (τ )} < ∞, for all n ≥ 1 and τ 0 exp{Λ 0 (τ )}dm(t) = τ exp{Λ 0 (τ )} < ∞, and the extended dominated converenge theorem (Athreya and Lahiri, 2006), lead to the conclusion that ξΛ n (D) as → ξ Λ 0 (D). This result, the almost sure consistency ofπ n , and the continuous mapping theorem imply that Consequently, Next, definef By the positivity of λ n and the definition off n , we have that P n Lf n,Λn,πn ≤ P n Lf n,Λn,πn + λ n f n 2 ≤ P n Lf ,Λn,πn + λ n f 2 = P n Lf ,Λ 0 ,π 0 + λ n f 2 + o as (1) Taking limit superiors in both sides we get lim sup n→∞ P n Lf n,Λn,πn ≤ P Lf ,Λ 0 ,π 0 = R φ (f ), almost surely, by the strong law of large numbers and the fact that λ n → 0. This implies that, fall all n sufficiently large, almost surely, by the definition off . Therefore, for all n sufficiently large, we have ≤ (P n − P )Lf n,Λ0,π0 + P n Lf n,Λn,πn − Lf n,Λ0,π0 ≤ (P n − P )Lf n,Λ0,π0 + δ, almost surely, for any δ > 0 by (6). Next, it needs to be shown that (P n − P )Lf n,Λ0,π0 = o p (1). By the definition off n we have, P n Lf n,Λn,πn + λ n f n 2 ≤ P n L f,Λn,πn + λ n f 2 , for any f ∈ F . Selecting f ≡ 0, we have P n Lf n,Λn,πn + λ n f n 2 ≤ P n ξΛ n gπ n , since φ(0) = 1 and 0 = 0. By the non-negativity of Lf n,Λn,πn (D), the previous inequality implies that by assumption A3. Therefore, for all sufficiently large n and any δ ′ > 0, we have almost surely. Now, the class of functions is Donsker, for any δ ′ > 0 small. This follows from the fact that if F is the class of linear or polynomial functions, then F is Donsker by Lemma 9.6 and Theorem 9.2 in Kosorok (2008), and subsets of Donsker class are also Donsker. If F is a RKHS, then the G 1 (δ ′ ) is Donsker by condition C3 and similar arguments to those used in the proof of Lemma A.9 in Hable (2012). Also, the class is Donsker by Corollary 9.32 in Kosorok (2008), since L f,Λ 0 ,π 0 is Lipschitz continuous in f , since the hinge loss is Lipschitz continuous in f . This implies that, for all n sufficiently large, √ λ n Lf n;Λ0,π0 belongs to the Donsker class G 2 (δ ′ ) and therefore √ n(P n − P ) λ n Lf n,Λ0,π0 = O p (1). Consequently, since nλ n → ∞ by assumption. Substituting this result in (7) we have Setting δ = δ n ↓ 0, with √ nδ n → ∞, implies that which completes the proof of the first statement of Theorem 2. For the second statement, it can be shown using similar arguments to those used in the proof of Theorem 3.2 in Zhao et al. (2012) that for any distribution P of the data D and any measurable decision function f : Z → R. Therefore, Now, if F is the space of linear functions and f * ∈ F , then inf Thus, consistency follows from the first statement of Theorem 2 and (8). Next, suppose that F is the RKHS with the Gaussian kernel and that the marginal distribution µ of Z is regular. Since the Gaussian kernel is a universal kernel, using similar arguments to those used in the proof of Lemma 3.4 in Zhou et al. (2017) leads to This, (8), and the first statement of Theorem 2 imply universal consistency.

A.3 Proof of Theorem 3
Here we provide the proof of the stronger (uniform) statement of Theorem 3 for the case where F is the space of linear functions. The proof of the weaker (pointwise) statement for any given measurable function f is a simplified version of the proof below and is not discussed further.
First, note that, for any f ∈ F , we have andV n (sgn(f )) = P n ξΛ n gπ n u f .

Now, straightforward algebra leads to
Next, for any functional h : F → R, define the supremum norm h F = sup f ∈F |h(f )|. For the term B n,1 (f ) in (9) we have The last inequality, along with the boundedness of max[(π n π 0 ) −1 , {(1 −π n )(1 − π 0 )} −1 ] for all n sufficiently large, as a result of assumption A3, the fact that √ n(π n − π 0 ) = O p (1), as a consequence of the central limit theorem, the fact that The term B n,2 (f ) can be expressed as follows The class of functions is a Vapnik-Červonenkis (VC) class by Lemma 9.6 in Kosorok (2008). This along with lemma 9.9 and theorem 9.2 in Kosorok (2008) imply that the class {u f : f ∈ F } has bounded uniform entropy integral. In addition, the latter class can be easily argued to be pointwise measurable and, therefore, this class is P -Donsker. By conditions C2 and C4 and Lemma 1, the fact that the class {ξ Λ : Λ ∈ L δ } is uniformly bounded by τ exp{Λ 0 (τ ) + δ}, assumption A3, and the fact that products of uniformly bounded P -Donsker classes are also P -Donsker, it follows that the class is P -Donsker for some δ > 0. Also, by assumption A3, we have The last two results along with the uniform consistency of the Nelson-Aalen estimator, guaranteed by conditions C1 and C4, and arguments similar to those used in the proof of Lemma 3.3.5 in van der Vaart and Wellner (1996) lead to the conclusion For the second term in the right side of (10), it can be shown that the map Λ → P g −1 This, along with the functional delta method (Theorem 3.9.4 in van der Vaart and Wellner, 1996) and the fact that lead to the conclusion that where γ i is considered fixed under the expectation operator P . Therefore, For the term B n,3 (f ), we have where |π * n − π 0 | ≤ |π n − π 0 |. For the first term in the right side of (11) we have The first term in the right side of the above inequality is o as (1) by assumption A3, the fact that |π * n − π 0 | ≤ |π n − π 0 |, the strong law of large numbers, the continuous mapping theorem, and condition C4, which guarantees the finiteness of Λ 0 (τ ). The second term in the right side of the last inequaltity is o as * (1) as a consequence of the P -Donsker property of the class which follows from similar arguments to those used in the analysis of the term B n,2 (f ) above, since this property implies that the latter class is also P -Glivenko-Cantelli. Therefore, using the last inequality gives Consequently, given that √ n(π n − π 0 ) = O p (1) by the central limit theorem, it follows from Taking all the pieces together, it follows from (9) that where ǫ n (f ) = B n,1 (f ) + ǫ n,1 (f ) + ǫ n,2 (f ) + o p (1) with ǫ n F = o p (1). Finally, the class of functions {ψ(f ) : f ∈ F } is P -Donsker as a consequence of the P -Donsker property of the class {u f : f ∈ F } as argued above, Lemma 15.10 in Kosorok (2008), and the fact that sums of P -Donsker classes which are multiplied by random variables with finite second moments are also P -Donsker.
Next, by Theorem 3 and arguments similar to those used in the proof of Lemma 3.3.5 in van der Vaart and Wellner (1996) it follows that for any preference weight w that satisfies the requirements of Section 2 in the main text.
Finally, the conclusion of Theorem 4 follows from the last result and the fact that the set W of preference weights is finite. where for h in the space D[0, τ ] of right continuous functions on [0, τ ] with left hand limits and The empirical versions of the influence functions arê for h ∈ D[0, τ ], and

Appendix C. Additional Simulation Results
In this Appendix we provide additional simulation results.

Scenario 4
Value function n=100 n=200 n=400 where +η ′Λ n,f,w (γ i ), i = 1, . . . , n, w ∈ W, f ∈ F , A further relaxation of the independent censoring assumption is to allow censoring to depend on both A and Z. In this case, one can impose a semiparametric Cox model of the form Λ(t; A, Z) = Λ 0 (t) exp{θ ′ (A, Z ′ ) ′ } for the right censoring time, and use the estimated conditional hazard in the proposed objective function and value function estimators.
Provided that this model is correctly specified, the theoretical properties of the proposed method still hold, with the exception that γ i (t) in ψ i,w (f ) (see Appendix B) is replaced by the influence function of √ n{Λ n (t) exp(θ ′ n (a, z ′ ) ′ ) − Λ(t; a, z)} under partial likelihood estimation.