Utility based approach in individualized optimal dose selection using machine learning methods

Abstract The goal in personalized medicine is to individualize treatment using patient characteristics and improve health outcomes. Selection of optimal dose must balance the effect of dose on both treatment efficacy and toxicity outcomes. We consider a setting with one binary efficacy and one binary toxicity outcome. The goal is to find the optimal dose for each patient using clinical features and biomarkers from available dataset. We propose to use flexible machine learning methods such as random forest and Gaussian process models to build models for efficacy and toxicity depending on dose and biomarkers. A copula is used to model the joint distribution of the two outcomes and the estimates are constrained to have non‐decreasing dose‐efficacy and dose‐toxicity relationships. Numerical utilities are elicited from clinicians for each potential bivariate outcome. For each patient, the optimal dose is chosen to maximize the posterior mean of the utility function. We also propose alternative approaches to optimal dose selection by adding additional toxicity based constraints and an approach taking into account the uncertainty in the estimation of the utility function. The proposed methods are evaluated in a simulation study to compare expected utility outcomes under various estimated optimal dose rules. Gaussian process models tended to have better performance than random forest. Enforcing monotonicity during modeling provided small benefits. Whether and how, correlation between efficacy and toxicity, was modeled, had little effect on performance. The proposed methods are illustrated with a study of patients with liver cancer treated with stereotactic body radiation therapy.

population based treatment. In other settings, the treatment choice is not binary and in particular consists of choosing the value of a continuous variable, such as the dose of radiation or a drug. The goal in this setting is to find the optimal dose for each patient to maximize the benefit of treatment. With the development of biomarkers, the goal can be achieved by evaluating how the biomarkers moderate the treatment effect on the outcome or outcomes. In statistical terms moderation of the treatment effect can be formulated as the interaction between the biomarkers and dose in a model.
In this article we consider the setting where there is an existing dataset from previously treated patients that contains treatment dose, observed efficacy and toxicity outcomes and covariates possibly including biomarkers. The available dataset could be from several clinical trials, cohort study, or real-world evidence studies for which both the sample size is relatively large and there is variation in the dose. The statistical goal is to analyze the data to learn an individualized dosing rule giving an optimal dose as a function of patient level covariates.
Supervised learning in the form of regression (for continuous outputs) and classification (for discrete outputs) is an important constituent of statistics and machine learning. Widely used parametric models such as linear regression and logistic regression are simple, but can suffer from model mis-specification. Concerns about mis-specification can be reduced by including extra features, such as interactions and splines, or there are many non-parametric methods that can be used instead. Because of their greater flexibility, there is less concern about model mis-specification, but also increased potential for overfitting. Decision trees such as Classification and Regression Trees (CART) and random forest, built upon CART, are easy to understand and have been used in determining the optimal treatment. 1 Their tree structure handles the interaction well but also increases the potential for overfitting. Methods based on kernel machines are also flexible and have good performance with high-dimensional data. Examples of these are support vector machines as used in the outcome weighted learning (OWL) 2 method, and Gaussian process. 3 Modern Bayesian semiparametric and non-parametric models also provide posterior uncertainty quantification which can be used to provide uncertainty estimates of patient specific treatment decisions. An example of this is using Bayesian Additive Regression Trees (BART) to provide individual treatment rules (ITR). This approach provides the uncertainty of the outcome associated with the optimal ITR. 4 In early phase studies in oncology, it is typical to describe the patient outcome in terms of both toxicity (T) such as adverse events and efficacy (E) such as overall response or progression free during specified timeframe. The benefit for each patient can be defined based on combining these two binary outcomes. By maximizing the potential benefit for each patient, the individualized optimal treatment or dose will be selected. Several strategies have been proposed for identifying an optimal treatment or dose based on the trade-off between efficacy and toxicity. To achieve maximum efficacy with tolerable toxicity, a utility function can be used as a weighted difference of probabilities of efficacy and toxicity. 5 More generally, a utility matrix is elicited from clinicians by assigning numerical utilities to each possible efficacy and toxicity outcome pair, which allows different preferences for different outcomes and then the utility function, defined as the expected value of the utility, at different dose levels are compared. 6 Contours characterizing the trade-off between E and T is an alternative and flexible approach. For this, the set of pairs for the probability of E and the probability of T on the same contour are equally desirable and the dose that maximizes the desirability is selected. 7 In oncology and other disease settings, it is frequently reasonable to assume that increasing dose leads to increased toxicity and efficacy. However, the increase may not be strictly monotone over the whole range of possible doses. For example, FDA guidance on cellular and gene therapy products mentions that indicators of potential benefit appear to plateau above a certain dose. 8 Thus, we will consider constraining the estimated dose-efficacy and dose-toxicity relationship to be non-decreasing for all patients. The effect of this should be to reduce the estimation noise, improve efficiency and hence yield more reliable results.
In previous work we built separate models for E and T and did not explicitly consider correlation between E and T. 5 In this article we relax the assumption of independence and use flexible machine learning methods to build the models for E and T using dose and biomarkers, which differs from the previously developed parametric models. Furthermore, in this article, we define optimal dose as the dose value that maximizes the expected utility where utility values are specified in matrix form for the 4 possible bivariate binary outcomes instead of utility functions, which are equivalent to restricted utility matrices with only a single degree of freedom (ie, one value to specify). In Section 2, we consider alternative uses of the utility function to select an optimal dose, including the addition of constraints with clinical motivation as well as the uncertainty of the utility function. In Section 3, we propose the use of random forest and Gaussian process models to build marginal models and link them with a Gaussian copula. We also build models on joint probabilities and use pool adjacent violators algorithm (PAVA) isotonic transformation to give a non-decreasing dose-efficacy and dose-toxicity relationship. A simulation study to compare different model building and optimal dose finding methods under several scenarios is summarized in Section 4. In Section 5, we illustrate the proposed methods in a dataset of patients with liver cancer treated with stereotactic body radiation therapy (SBRT). We close with discussion in Section 6.

Utility function and matrix
We assume the available data, as the marginal probabilities for the efficacy and toxicity outcomes. One way to combine the efficacy and toxicity outcomes is through a utility function such as with tradeoff parameter > 0, either pre-specified by physicians or calculated as a tuning parameter to meet a pre-specified level of toxicity in the population. 5 Alternatively, a unique utility value can also be specified for each possible bi-variate patient outcome (E, T) in a utility matrix, 6 as shown in Table 1. We assign U 10 the highest value to the best possible outcome (efficacy and no toxicity), and U 01 the lowest value to the worse possible outcome (toxicity and no efficacy). U 00 and U 11 have values in between and can be larger or smaller or equal to each other. Denote the joint probability of E and T given d and x as p ET (d, x) for E, T in {0,1}. We define a utility function as the expectation of this utility matrix with respect to the joint probability of E and T at dose d and covariates x. The goal is then to maximize the utility function, or the expectation of the utility matrix for fixed x at a grid of possible dose values.
We define the Utility functionŪ(p(d, x), ) as the expectation of the utility matrix in Table 2, When 1 + 2 = 1, the gain in utility when moving from E = 0 to E = 1 is the same for both levels of For this special case the models for E & T can be built separately, 5 and no assumption concerning independence or dependence of E and T is needed. However, when 1 + 2 ≠ 1, this is not the case and the utility function does depend on the correlation between efficacy and toxicity.
For a subject with covariates x, we can directly calculate the optimal dose when the true probabilities p(d, x) are known, by maximizing the utility function across possible dose values. That is (2)

Modified utility functions
In many settings including oncology, both efficacy and toxicity increase with increasing dose and clinicians may be unwilling to give treatments associated with high rates of toxicity, even if they are also highly efficacious. To address this  concern in dose selection, we can modify the utility function by increasing the penalty on higher values of p T (d, x). As an illustration we will modify the utility function such that if p T (d, x) ≥ 0.3, we subtract 2 1 p T (d, x) from the utility function, which corresponds to a tripling of the weight for toxicity higher than 0.3 and gives a utility function of In some settings, it may be desirable to limit the between patient variance in selected optimal dose and/or to avoid extreme doses. This can be accommodated by adding penalty for extreme doses to the utility function. Let d fix be a population fixed dose that could be pre-specified as a standard value, or estimated by fitting a dose-only model without covariates x in it. Then the modified utility function is where is the penalty parameter and d fix is the population fixed dose. Figure 1 shows the trade-off contours over the two-dimensional space with different utility functions. The pairs of (p E , p T ) on each contour are considered to be equally desirable (ie, have the same utility function value), as denoted by the solid lines. For a given patient with covariates x, the p E (d, x) and p E (d, x) among the possible dose range is shown as a dashed line, and the dose that maximize the utility function is selected, as shown by the solid dot. When 1 + 2 = 1, the utility function is a weighted difference of p E and p T and results in straight line contours. The other utility functions have curved or piecewise linear contours. ForŪ 1 , the utility function decreases rapidly when p T (d, x) is above the pre-specified toxicity limit of 0.3.Ū 2 has l 2 -penalty for dose and when the dose is too high, the contours could have negative slopes. When 1 + 2 ≠ 1, the four plots (d) to (g) indicate the influence of different 1 , 2 values and the correlations of E and T. The lines are curved with p 11 (d, x), and the joint distribution of E and T plays a role. These plots shows different choices of optimal dose as the utility function changes. For (e) to (g), the correlations are 0, 0.8, and −0.8 with the same set of 1 , 2 and marginal probabilities for a given patient, the utilities differs a lot, but the optimal doses are similar.

Posterior distribution of utility function
Given the data  N , the point estimate of the probabilities for the possible doses, denoted byp(d, x), can be plugged into Equation (2). Then the individual optimal dose for subject with covariates x is the dose that maximizes the weighted sum of the utility values using the joint probability estimates. When using Bayesian estimation, we will get the posterior distribution of the joint probabilities p(d, x) for the possible doses, and thus the posterior distribution ofŪ(p(d, x), ). The optimal dose for subject with covariates x is selected as the dose that maximize the posterior mean ofŪ(p(d, x), ) given the data  N .
To assess the uncertainty of the estimation, we can consider the posterior distribution of the utility function. Then an alternative way to define the optimal dose for subject with covariates x is as the dose that maximizes the posterior probability that the utility function is higher than the utility function at some fixed dose which is the same for everyone. With this approach, doses associated with small increases in expected utility will not be selected if they also greatly increase the variance of the expected utility.

MODEL BUILDING
There are various approaches to model the joint distribution of E and T of which we consider three: model the marginal distributions for E and T and then assume independence, or use a copula to link them as the joint distribution, or directly model the 4 level categorical outcome (E, T) on d and x. Instead of using parametric models, we will consider using non-parametric approaches (eg, random forest) and kernel machines (eg, Gaussian process) to model the joint distribution of E and T.

Random forest
Random forest (RF) for classification constructs multiple trees using bootstrapped samples from the original data. The subject left out in the construction of the kth tree will go down the kth tree to get a classification, which is the out-of-bag (oob) classification. Then the proportion of classifications of each type across all the oob classifications for this subject will be used to calculate the predicted probability. First, we use RF to build models for E and T separately to obtain estimates ofp E (d, x) andp T (d, x) for each subject. After obtaining estimated marginal probabilities, if we assume conditional independence of efficacy and toxicity, then the joint distribution of (E, T) is given by the product of the marginals. Alternatively, we can use a copula to link the marginals as described in the Appendix A. We can plug thep E (d, x) andp T (d, x) for each subject into the copula function and maximize the loglikelihood across the dataset  N to estimatê, the correlation parameter in the copula. Rborist by Seligman was used to implement the random forest due to its computational efficiency. 10 In random forest, the predicted probabilities of E or T may decrease over some range of dose for some patients. To prevent this, we can apply constraints during the estimation of the random forest. Specifically, whenever a split on dose is selected, we require the left node (lower dose) to be associated with lower probabilities of E or T than the right node (larger dose). 10 Instead of modeling marginals and linking them to obtain the joint distribution, we can model the joint probabilities directly. When doing this using methods such as random forests, we can't add monotonicity directly in the model since the desired monotonicity is in terms of each marginal and not in the four joint probabilities p 00 , p 10 , p 01 , p 11 . Thus in this setting we use a post estimation procedure in which we adjustp 00 ,p 10 ,p 01 ,p 11 to obtainp * 00 ,p * 10 ,p * 01 ,p * 11 so that the corresponding p E and p T are monotone in d for all x. Specifically, we use the Pool-Adjacent-Violators Algorithm (PAVA) isotonic transformation for the marginal distribution, 11 then we will find a set (p * 00 ,p * 10 ,p * 01 ,p * 11 ) which is close to (p 00 ,p 10 ,p 01 ,p 11 ) and also satisfies the monotonicity constraint.
For every patient, using the model of categorical outcomes built on the dataset, we have predictions of (p 00 ,p 10 ,p 01 ,p 11 ) for all the dose levels and thusp E andp T . Then we apply PAVA onp E andp T to getp * E andp * T which is non-decreasing in dose across the feasible dose range. At each dose, the joint probabilityp * 11 has a range of [max(0,p * E +p * T − 1), min(p * E ,p * T )]. One way to perform the adjustment is to minimize |p * 11 −p 11 | within the allowed range to getp * 11 , thenp * 10 ,p * 01 ,p * 00 can be calculated sequentially from the knownp * 11 ,p * E ,p * T . Another way is to minimize |p * 11 −p 11 | + |p * 10 −p 10 | + |p * 01 −p 01 | + |p * 00 −p 00 |, which involves all the joint probabilities.

Gaussian process
As a second approach to estimating the marginal distributions, we use a Gaussian process (GP) approach. Specifically, we model the outcome y as a distorted version of the process f, where f is a non-linear latent function which has a Gaussian distribution prior with mean 0 and covariance functions K. For example, a continuous outcome might use a normal distribution around f, while a binary outcome would use a logistic link or probit link with f. With the Gaussian prior and likelihood the Gaussian posterior of p(f|y) is given by where K is a kernel function that represents the distance between subjects in the covariate space denoted by z = (d, x).
Multiple kernel functions could be used and we use a Gaussian kernel with each element in K defined by , for subjects i and j = 1, … , N and dimension q = 1, … , Q + 1. 2 is called the signal variance or the magnitude, and q 's are the characteristic length-scales of the input-space for each dimension of z. This covariance function implements automatic relevance determination (ARD), 12 since the length-scale q determines how relevant an input is: if the length-scale is small, the covariance will become almost independent of that input, effectively removing it from the inference. The ARD can be used in selection of the dimension z related to the outcome, especially with high dimension data or sparse data. We assign priors to , q and estimate the posterior distribution of them from the data. In our setting, we assume For each subject, the marginal probabilities are, here we use a probit link to connect f to the outcome. Because f are centered around 0, intercepts a, b are used for the unbalanced outcomes. Then the joint probability can be obtained by the Gaussian copula Alternatively, when assume independence of efficacy and toxicity, Then the joint probabilities are respectively. The outcome (E, T) for each subject follows Multinomial(p 00 , p 10 , p 01 , p 11 ), then the posterior is given by We can use a Bayesian approach to estimate ( E , qE , T , qT , a, b, ; q = 1, … , Q + 1) jointly with some prior distribution. rstan was used to implement the Bayesian analysis of Gaussian process since it is more efficient and explores complicated posterior distributions better compared to rjags. 13 For Gaussian process, there are no methods to ensure monotonicity of predictions with respect to dose across the entire covariate space. Rather, virtual data points can be included and the derivative of the Gaussian process (with respect to dose) at those points can be encouraged to be non-negative during estimation. 14 4 SIMULATION STUDIES

Settings and scenarios
To evaluate the performance of the different methods, we conducted a simulation study. We consider 11 scenarios by varying the true parameter coefficients, the functional form (eg, exponential or binary or linear), the number of biomarkers, and the sample size. For each scenario and each method of analysis and definition of optimal dose, we fit the models using the training dataset, and evaluated their performance under the true models which are known only in this simulation setting.
In scenario 1, we generated N=200 observations in a training dataset by simulating 5 i.i.d covariates, x (1) , … , x (5) from a standard normal distribution, d from Uniform(−1,1), and binary outcomes E and T from the regression model described below with a copula to link them. The marginal probabilities are defined by models with covariates x, dose d, and dose-covariate interactions dx In generating x's, we also applied the constraints that x must satisfy 1 + 0.23x 1 + 0.61x 2 + 1.69x 4 + 0.5x 5 > 0 and 1 + 0.03x 1 + 0.6x 2 − 0.42x 4 + 1.04x 5 > 0 to reflect the non-decreasing dose-efficacy and dose-toxicity curves. This excludes up to 40% of initially simulated observations. The other scenarios considered are shown below in Table 3 . In all scenarios the true model coefficients were chosen so that the proportion of observations with E = 1 was about 50%-70% and with T = 1 was about 10%-30%.
The simulation consist of the following steps iterated 1000 times. (1) Generate a training dataset of the specified size (N=200 or 400). (2) Fit models for efficacy and toxicity on the training data. (3) Generate a validation dataset following the same approach as for the training data. (4) Calculate the estimated optimal dose for each patient in the validation data using the models estimated from the training data and a particular method for selecting optimal dose based on the models. Also calculate the true efficacy and toxicity probabilities for each patient, conditional on the estimated optimal dose, using the true models. (5) Calculate the average and SD of the estimated optimal dose values, average probability of efficacy, average probability of toxicity, average value of the true utility function, and average improvement in utility defined as the fraction of the difference between the expectation of the utility from the true model and the expectation of the utility from using a fixed dose. The metrics in step (5) are then averaged over the 1000 simulations to provide a comparison of the various methods for model building and optimal dose selection across various scenarios. We consider several variations of random forest modeling. To allow for correlated efficacy and toxicity outcomes we take three approaches consisting of using a copula method to link the marginal models (M2), assuming conditional independence (M3) and also directly modeling the 4 level bi-variate outcome (M8). We also implement a version of random forest in which the dose effects on efficacy and toxicity are constrained to be non-negative for all patients and compare these results to unconstrained estimation (M4, M5). When using random forest to model the 4 level categorical outcome, we also assess the impact of a post-estimation use of a PAVA algorithm to enforce monotonicity (M9).
For the Gaussian process, we compared different priors for q including inverse gamma prior, Laplace prior and horseshoe prior. They have similar performance with low dimension of covariates, but the horseshoe prior is better with high dimension as it allows variable selection. Specifically, q ∼ N(0, 2 q 2 ) for q = 1, … , Q, where q ∼ cauchy(0, 1), ∼ cauchy(0, 1). The global parameter pulls all the weights globally towards zero, while the thick half-Cauchy tails for the local scale q allow some of the weights to escape the shrinkage. With large , all 's have diffuse priors with very little shrinkage toward 0, but small will shrink all the 's to 0. All the other parameters have a standard normal distribution as priors. For Gaussian process, we tried both using copula and assuming independence to link the two marginals. The results were very similar so we only include those corresponding to the independence assumption (M10, M11). Although we did investigate imposing monotonicity constraints in Gaussian process modeling through the use of virtual data points, we did not include these methods in our simulation because they are time consuming and in limited simulations appeared to have little effect on the results.
For comparison, a multinomial logistic regression with 4-level categorical outcome is built on dose only (M12), and the selected optimal dose is adopted as d fix in M7 and M10. All the methods considered are summarized in Table 4.
The utility matrix plays an important role in optimal dose selection so for each scenario, we consider three utility matrices. The first places higher utility on the outcome of both efficacy and toxicity compared to neither efficacy nor toxicity ( 1 = 0.3, 2 = 0.5). The corresponding utility function isŪ (p(d, x), The second utility matrix we consider, weights these two possible outcomes equally ( 1 = 2 = 0.5). In this setting the correlation plays no role as it cancels out of the utility function, asŪ(p(d, x), In the third utility matrix we consider, the outcome of no efficacy and no toxicity is preferred to the outcome of both ( 1 = 0.5, 2 = 0.3). The utility function isŪ (p(d, x),   , x), )

M6
RF on marginals monotone on d with copulaŪ (p(d, x),

M12
Fixed dosing: multinomial logistic regression on doseŪ(p(d), ) Note: RF denotes random forest, GP denotes Gaussian process. The size of the point corresponds to the number of patients are near the optimal fixed dose. There is little difference between methods using copulas to allow for correlation and assuming independence (M2 vs M3, M4 vs M5), suggesting that use of the copula may affect the estimation of utilities as shown in Figure 1, but does not have a large impact on the distribution of selected doses. As expected, method M6 tends to select lower doses than method M4 due to the individual toxicity limit in the utility function. For method M7, the majority of the patients have optimal doses around the fixed dose because of the dose penalty in the utility function. For the random forest with categorical outcomes (M8, M9), adding PAVA to the predicted probabilities affects the selection of optimal dose. For method M10 which uses Gaussian process and maximizes the probability of utility higher than the fixed dose, the majority of the patients have optimal doses around the fixed dose. For method M11 which uses Gaussian process and maximizes the posterior mean of the utility function, the distribution of the optimal dose is most similar to the distribution from the true model. Table 5 provides mean values for dose, efficacy, and toxicity across methods. It also provides the average (across simulated datasets) SD of dose and mean values (across simulated datasets) for the average utility (across patients within  dataset) under scenario 1. To compare the various methods on a common scale, we calculate the percent of possible increase in expected utility relative to fixed dose, denoted by % IPŪ. Thus by definition the fixed dose approach has 0% improvement and the true models have 100% improvement with other methods generally falling in between these values. When using utility 1, 1 < 2 , so on average the optimal dose is higher and as a result, average efficacy and average toxicity are both higher. Under utility 3, 1 > 2 , the average optimal dose is lower, and average efficacy and average toxicity are lower. Under this scenario, within each utility function, use of the true models results in the highest average expected utility, and the fixed dose results in the lowest average of expected utility. The Gaussian process methods (M10, M11) have the highest mean utility values with the other methods having generally similar performance varying somewhat across utility matrices. In practice, it is possible that none of the considered covariates (eg, biomarker panel) have any true association with the patient outcomes. To mimic this setting, we also consider a null scenario (S0) in which the true E and T models have only d without covariates x in them. Thus the dose effects are the same across all patients. As shown in Figure 3, the dose-only model has nearly as high of utility values as are possible using the true model. All other modeling approaches which consider other covariates, generally have poorer utility values than fixed dosing in this setting.
The comparison of expected utility under all scenarios is shown in Figure 3 and figures in the appendix, Figures B1  and B2. Table 6 shows the comparison of percentage of expected utility improvement under all scenarios with utility 1. In scenario 2, when there are only dose main effects and no dose-covariate interactions, the main effects of the covariates still play a role in optimal dose selection, and the Gaussian process which maximizes the posterior mean of utility function (M11) is better than other methods. In scenario 3, when covariates are correlated, the Gaussian process methods (M11, M10) outperform the other methods. In scenario 4 and 5, with the increased number of noise covariates, the ARD feature of the Gaussian process and the horse-shoe priors help in model building and methods M12 and M13 have better    Figure 2, the optimal dose distribution of Gaussian process which maximizes posterior mean of utility function (M11) is most similar to the true model (M1), and also has the highest percent improvement in utility. Using Gaussian process modeling and selecting dose to maximize the posterior probability of higher utility than associated with fixed dosing (M10), tends to put a lot of patients near the fixed dose, but still improves the mean utility function compared to fixed dose. Besides the above scenarios, we also considered smaller sample sizes and lower efficacy rate of about 20%-30%, and simulation shows GP maximizing expectation of utility (M11) is still the best (results not shown).

Parametric vs non-parametric models
In the above simulation, we compared different non-parametric methods and their performance in optimal dose selection.
In this section, we compare a subset of the non-parametric methods with parametric models such as logistic regression (M13), LASSO (M14) and constrained LASSO (CLASSO, M15) proposed by Li et al. All the parametric models are built on dose d, covariates x, and dose-covariate interactions dx for the marginals and we use copula to link them. We also explore the multinomial logistic regression on 4 categorical outcomes with dose, covariates and interactions (M16), and add PAVA for the non-decreasing relationship (M17). The results are shown in Table 7. When the parametric models are correctly specified, they are more efficient and give higher utility values than the non-parametric models, as shown by S1, S4, and S7. Furthermore, we considered more situations when the complexity of the true model increases and the estimation model is mis-specified. The true efficacy and toxicity models can have more interactions such as covariate interactions x i x j and dose-covariate-covariate interactions dx i x j . The dose effect is g (d) where g(.) is a non-linear function. The efficacy model has log(d + 1) instead of d, so the efficacy increases faster at lower doses and tends to level off. Similarly, the toxicity model has exp(d) instead of d, so the toxicity increases faster at high doses. In the limited mis-specified scenarios with detailed description in Table B1, parametric models are quite robust against unspecified interactions as under S8 and MS1, and non-parametric models have comparable performance. On the other hand, with non-linear dose effects, non-parametric models outperform the parametric models as shown by MS2-4.

APPLICATION
In this section, we applied the proposed methods to a dataset of liver cancer patients treated with adaptive stereotactic body radiotherapy ( Patients in this study received tumor doses ranging from 35.7 to 180 Gy, with variation partially due to the clinical trial on which they were enrolled, varying preferences of the treating clinician, as well as patient factors such as cancer stage, tumor location and performance status. In radiation oncology, the dose to the tumor site (most relevant dose for predicting efficacy) is different from the dose received by the normal liver tissue (most relevant for predicting liver toxicity). The ratio of these two doses can be assumed to be constant for an individual patient but varies between patients due to tumor volume and location within the liver. We apply our methods to estimate the optimal tumor dose and calculate the implied normal liver dose for each patient using this fixed ratio. Two patients are missing liver dose and 6 are missing pre-treatment AFP. Single imputation was used to fill in these missing values using the other covariates and outcomes.
In this setting, achieving efficacy is less important than avoiding toxicity so for the utility matrix, we set 1 = 0.6 and 2 = 0.4, considering an outcome of neither efficacy and toxicity (E = 0, T = 0) preferable to an outcome of both toxicity and efficacy (E = 1, T = 1). For the random forest on the marginal distributions, we include tumor dose, tumor size GTV, and pre-treatment AFP in the efficacy model, and liver dose, pre-treatment ALBI, tumor size GTV, and prior liver directed therapy in the toxicity model. For the random forest on categorical outcomes and Gaussian process, we include tumor dose and the ratio of liver dose to tumor dose as well as all clinical features. For the fixed dosing method, we build a multinomial logistic regression model on tumor dose and the ratio of liver dose to tumor dose.
Due to variation in the ratio of liver dose to tumor dose, the dose only model (M12) estimates different optimal doses for each patient. It also selects on average the highest individualized optimal tumor dose for each patient, as shown in Figure 4. Similarly, M7 incorporates the dose penalty towards the median dose selected by M12 and exhibits the least variability in selected dose values with a high dose selected for the majority of patients. There is little difference between the independence assumption or using the copula to link the two marginals, as shown by M2 vs M3 and M4 vs M5. As expected, adding the individual level toxicity penalty (M6) lowers the recommended doses. Use of the PAVA monotonicity  algorithm on predictions from the random forest on the 4-category outcomes has some impact on the selected optimal dose (eg, see medians for M8 vs M9). The Gaussian process methods recommend a wide range of doses to different patients, as shown by M10 and M11.
To illustrate how the proposed utility calculations work we plot in Figure 5 the estimated probabilities of efficacy and toxicity and the corresponding utility functions vs dose for 3 selected patients. In the top row (a) we show the predictions from random forest models for efficacy and toxicity with copula to link them (M2), and the bottom plots (b) show the random forest models for efficacy and toxicity constrained to be monotone in dose and linked with copula (M4). Without forcing monotonicity, each of these 3 patients have regions of dose where, as dose increases, the estimated probability of efficacy decreases, which is not biologically plausible. Imposing monotonicity results in believable estimates and also smooths the dose-efficacy curves somewhat. The bottom panel (b) of the figure shows optimal doses for methods M4, M6, M7 which all utilize the same random forest models but differ in how they define optimal dose from these models. For patients 1 and 2, imposing monotonicity on the dose efficacy curve, results in a similar optimal dose. Patient 3 represents a patient with high baseline risk of toxicity exceeding what would typically be expected from dose only toxicity models. Because the probability of toxicity is higher than 0.3 even at the lowest dose, method M6 with the individual toxicity penalty recommends the minimal dose. A higher dose is recommended from M7 however, as efficacy increases while toxicity plateaus. Plots showing other methods including M8, M9, M10, M11 on those three patients are included in the Appendix Figure B3. The random forest for the categorical outcome results in non-monotone dose-efficacy and dose-toxicity relationship, and the PAVA which adjust the curves results in a plateau. The Gaussian process have smoother curves for dose-efficacy and dose-toxicity. Due to the limited sample size, the credible interval of the estimates is wide as shown by the shadow around the posterior mean of utility.
The random forest method discretizes continuous variables and the prediction curves are not as smooth as standard regression models. The trade-off however is that in contrast to a parametric model, they can approximate any arbitrary shaped smooth function with sufficient data and number of splits.

DISCUSSION
We have developed statistical methods to estimate individualized optimal dosing rules using utilities to make the trade-off between efficacy and toxicity quantitative and explicit via input from clinicians. In addition to quantifying the utility of each possible bivariate outcome, this approach requires models estimating the probability of each possible bivariate outcome as it depends on patient covariates and treatment dose. We have proposed use of flexible machine learning methods such as random forest and Gaussian process to build these models for efficacy and toxicity. In our simulation studies, whether and how correlation between efficacy and toxicity was modeled, had little impact on the selected dose or on patient outcomes. Constraining the estimated probabilities to be monotone in dose, did provide modest improvements in outcome. Equally important, enforcing monotonicity results in believable results which is important for developing and maintaining trust with non-statistical collaborators. Monotonicity on a single covariate can be encouraged by using virtual points in the Gaussian process approach, 14 or as a splitting criteria in random forest and gradient boosting trees. But for BART, current software implementations only allow for monotonicity on all covariates (or none), 15 which is not appropriate for our setting.
In practice, the utility matrix can be pre-specified by clinicians either directly or indirectly through a series of questions. Simulations or hypothetical patient data may also be useful to illustrate estimation of optimal dose which may motivate the modification of utility values 1 and 2 in further discussion with clinicians. Methods to obtain consensus utilities by summarizing questionnaires from a group of clinicians can also be used, such as the Delphi method. 16 In future work, individual patient preference for the efficacy and toxicity outcomes can also be considered, to build an individualized utility matrix. One difficulty with such an approach is that it may be difficult for patients to quantify a utility for outcomes they have never experienced. It is also possible to add penalties to the utility function to take into consideration the financial cost of the treatment, which in some situations will depend on dose.
The expectation of the utility matrixŪ(p(d, x), ) has been used to select the optimal dose for each patient. Given the observed data  N , frequentist estimation ofp(d, x) can be plugged into the utility function for optimization. Alternatively, Bayesian estimation will give the posterior distribution of p(d, x). The posterior mean ofŪ (p(d, x), ), which is the average of utility function over the posterior distribution, is used as the common Bayesian approach for optimal dose selection. However, the posterior mean of the utility function ignores the variance of the posterior distribution, which reflects the uncertainty of the estimation. In this article, we also propose to maximize the posterior probability that the utility function is higher than the utility function at a fixed dose. It is possible that the posterior mean of the utility function increases slightly with dose, but also has a larger variance, so there is not much improvement of the posterior probability of an improvement in the utility function. The posterior probability represents how confident we are about the improvement of the utility function, which can help clinicians in decision making. In the simulations, it is shown that using the posterior probability can help to avoid assigning extreme doses to patients, and shrink the individual optimal dose towards the population fixed dose.
We considered a number of different machine learning methods as a component of our procedure, when predictions were required. Non-parametric machine learning methods are flexible, and make only weak assumptions about the underlying functional form. But they generally require a large sample size to estimate the functions and may suffer from overfitting. Thus the sample size as well as the number of covariates and the expected complexity of the function should play a role in determining which method is appropriate in any setting. With smaller sample size, parametric machine learning methods such as neural networks can be considered as a powerful tool, or classical methods such as logistic regression could be appropriate. Other tree based methods could also be considered, such as BART 17 and gradient boosting trees. 18 Using Bayesian estimation, Gaussian process and BART result in smooth predictions and have good performance in terms of AUC and RMSE 19 compared to random forest and gradient boosting. An additional benefit, is the ability to quantify uncertainty through posterior distributions. The nature of the tree structure in random forest, BART and gradient boosting trees means these methods should be well suited to handle the interaction of dose and covariates. On the other hand, Gaussian process uses a Kernel function to describe the relationship across dose and covariates and should work better with continuous variables such as dose.
The performance of the Gaussian Process method depends critically on the prior distributions. In particular when there are a large number of covariates, we need to select an appropriate prior for , the weight for each covariate in the kernel function. A Laplace prior, which is a Bayesian version of robust Lasso regression, did not perform well in our simulation study when we had a large number of covariates and sparsity. The spike-and-slab prior is appealing as it places a substantial point-mass on zero, but can be computationally intensive with a large number of covariates. The horseshoe prior has been shown to have comparable performance to the spike-and-slab prior and is more computationally efficient because of its global parameter and local scales. 20 We adopted the horseshoe prior in our simulations and found it had good performance, even with a low number of covariates.
There are many opportunities for extending the proposed methods. For example, the outcomes for defining the utility may be more complex than a single binary efficacy and toxicity outcome as considered in this article. Utility values can be assigned to ordinal categorical efficacy and toxicity outcomes and the 2×2 utility matrix can be expanded. Survival outcomes can also be broken down into time intervals and then used to assess the trade-offs between quality and quantity of life, as in quality adjusted survival analysis. 21 The setting we consider in this article is where dose can be thought of as continuous, and the data has a large number of distinct dose values. The proposed methods have great potential for application in selecting optimal dose in radiation oncology and also more broadly in oncology to select an optimal dose of chemotherapy or immunotherapy. A continuous dose scheme minimizes information loss, it can also be expanded to settings with sufficient finite and discrete dose levels and may yield similar results, 22 and the pre-specified dose level that maximizes the utility can be selected. In other settings there may be a much smaller number of potential doses, say 2 or 3. For this the utility framework is still useful, but different models for the probability of efficacy and toxicity may be preferred. and p 11 (d, x) = c(1 − p E , 1 − p T ) − p E − p T + 1.
Different copula functions can be used. For this article a Gaussian copula is adopted where is the correlation parameter of Φ −1 (v E ) and Φ −1 (v T ). 23 The marginal probabilities p E , p T can be specified by parametric models and parameters can be estimated jointly in the loglikelihood together with . Two step estimation methods can also be used, 24 where in the first step estimates are obtained for the parameters from the marginal models and are then held fixed in the second step, which consists of maximizing the likelihood to estimate the association function parameter. Joe and Xu showed that with the 2-step estimate for parameters is consistent and asymptotically normally distributed.

TA B L E B1
List of scenarios to compare parametric vs non-parametric models

S1
The true E & T models have x, d, dx S4 S1 with 15 noise covariates x 6 , … , x 20 added to the data S7 S1 with binary covariates x

S8
The true models have x, xx, d, dx Of the 5 covariates, coefficients are non-zero for 4 x, 1 xx, 4 dx

MS1
The true models have x, xx, d, dx, dxx Of the 5 covariates, coefficients are non-zero for 1 x, 1 xx, 1 dx, 2 dxx

MS2
The true models have x, g(d), g(d)x Of the 10 covariates, coefficients are non-zero for 10 x, 1 g(d)x

MS3
The true models have x, g(d), g(d)x, g(d)xx Of the 5 covariates, coefficients are non-zero for 1 x, 1 g(d)x, 2 g(d)xx

MS4
The true models have x, xx, g(d), g(d)x, g(d)xx Of the 5 covariates, coefficients are non-zero for 1 x, 1 xx, 1 g(d)x, 2 g(d)xx