Assessing the performance of different outcomes for tumor growth studies with animal models

Abstract The consistency of reporting results for patient‐derived xenograft (PDX) studies is an area of concern. The PDX method commonly starts by implanting a derivative of a human tumor into a mouse, then comparing the tumor growth under different treatment conditions. Currently, a wide array of statistical methods (e.g., t‐test, regression, chi‐squared test) are used to analyze these data, which ultimately depend on the outcome chosen (e.g., tumor volume, relative growth, categorical growth). In this simulation study, we provide empirical evidence for the outcome selection process by comparing the performance of both commonly used outcomes and novel variations of common outcomes used in PDX studies. Data were simulated to mimic tumor growth under multiple scenarios, then each outcome of interest was evaluated for 10 000 iterations. Comparisons between different outcomes were made with respect to average bias, variance, type‐1 error, and power. A total of 18 continuous, categorical, and time‐to‐event outcomes were evaluated, with ultimately 2 outcomes outperforming the others: final tumor volume and change in tumor volume from baseline. Notably, the novel variations of the tumor growth inhibition index (TGII)—a commonly used outcome in PDX studies—was found to perform poorly in several scenarios with inflated type‐1 error rates and a relatively large bias. Finally, all outcomes of interest were applied to a real‐world dataset.


| INTRODUC TI ON
A common way to assess the efficacy of an anticancer treatment is to analyze the tumor growth in patient-derived xenografts (PDX). [1][2][3][4] The PDX approach involves the direct implantation of human tissue samples into mice, where the mice are assigned to treatment groups then tumor volumes are measured over multiple weeks. This direct implantation into mice may allow for more heterogeneity than laboratory-grown tissue samples, 1,2,5 and therefore the results may be more directly applicable to human patients.
A PDX study typically assigns many mice to each source of human tissue. For example, one study might have 5 different people contributing tumor samples that are then each allocated to 10 mice (all 10 receiving a distinct sample of that patient's tumor), resulting in a total of 50 mice. For the purpose of this study, we will define one person's set of mice as a PDX line. Continuing with the example, there would be 5 PDX lines with 10 mice each. Studies that use more than 1 PDX line have the potential benefit of adding more real-world variability (variability across patients) to the study, which may make the results of an antitumor treatment more generalizable.
When quantifying the effect of antitumor treatments, there are a multitude of approaches that a researcher might take. Although these multiple approaches can provide a more individualized approach to each unique study, this may lead to inconsistent reporting of results across studies. One key focus when designing any study should be the selection of the outcome. In a PDX study, the typical question of interest is "is the treatment more effective at reducing tumor growth than a comparator treatment, and if so, by how much?". There are many potential outcomes that one might choose to answer this, so which outcome is the "optimal" choice? Ideally, the optimal outcome should be chosen primarily on the basis of statistical considerations, especially when there are multiple outcomes that appear to answer the question clinically. As evident by the heterogeneity in the approaches reported in PDX studies, [6][7][8][9][10][11][12][13] this field lacks a standard approach to outcome selection-justifying the need to investigate an array of potential outcomes. The purpose of this study is to evaluate the performance of different continuous, categorical, and time-to-event outcomes with respect to bias, variance, power, and type-1 error, and to ultimately provide empirical reasoning for outcome selection in PDX studies.
From a statistician's standpoint, the most intuitive outcomes for PDX studies would be the raw tumor volume or the change in tumor volume from baseline. In a cross-sectional study, this would correspond to the final volume measured and the final difference in volumes (final-initial), respectively. If noncontinuous outcomes were of interest for a given study design, then a binary or categorical variable could easily be computed if biologically meaningful cutoff values for either the final volume or the final difference exist. Another more detailed noncontinuous outcome would be a time-to-event outcome, where we could compare summaries such as the median time (in days) until the tumors reach a certain threshold in size, if they do in fact reach that threshold.
Beyond the more intuitive outcomes in the prior paragraph, there exists a range of other possibilities that warrant exploration and rigorous empirical evaluation. First, since the final difference is intuitive, then the final ratio (final volume divided by initial volume) may be of utility as well. Second, the area under the curve (AUC) could be computed using the trapezoidal rule. There are 2 options for the AUC calculation: (1) use only the first and last measurements, or (2) use all timepoints since tumor size is generally evaluated on a set schedule over the course of the study. By using all timepoints, we may be able to capture more information about the tumor growth over the course of the study and potentially differentiate treatments that delay growth even if they have similar volumes at the end of the study. The tumor growth inhibition index (TGII) is commonly used as a summary in PDX studies for each PDX line and is usually calculated on the basis of the ratio of mean tumor volume in different treatment groups. However, if TGII were the outcome for a single PDX line, new definitions with respect to individual mice by treatment would need to be defined and evaluated. With this motivation based on use of TGII in PDX studies, novel variations of the TGII were proposed and evaluated for this article. Since TGII is a ratio, it seems statistically intuitive to also investigate this outcome as the relative difference in tumor growth (using subtraction rather than division), though the clinical interpretation may not be as straightforward. The proposed TGII measures are defined in Section 2.1.
The remainder of this manuscript focuses on identifying the optimal outcomes for use within a single PDX study. Section 2 details the specific outcomes considered, details the simulation study, and introduces the real-world dataset of a PDX study. Results are summarized in Section 3 for both the simulation study and real-world results. Sections 4 and 5 include a brief discussion and conclusion.

| Outcomes of interest
The performance of 18 outcomes (15 continuous, 2 categorical, 1 time to event) was compared via simulations, asymptotic properties, and application to a real-world PDX study. Conceptually, we classify the 15 continuous outcomes as either "individual" or "relative" outcomes (Table 1). "Individual" continuous outcomes are applied to each tumor, whereas "relative" outcomes are applied to the treated group only because the control group values are already incorporated into the calculations. Figures 1 and 2 present schematics for the calculation of selected outcomes for individual and relative measures, respectively, and are described in greater detail after introducing the measures in the following paragraph.
Two "relative" continuous outcomes were used: (1) TGII, Δ (t) Δ (c) , and (2) relative difference, Δ (t) − Δ(c).; where Δ is the difference in the volumes between the final and the initial tumor measurement, t is the treatment group, and c is the control group. We proposed and evaluated 5 novel estimators each for both TGII and relative difference ( Table 1). Four of these novel estimators are applied directly to each treatment group tumor, whereas one is a group-level estimator.  Table 1). The common denominator (TGII) or common difference (relative difference) estimator (4) compares each treatment group tumors with the average volume of the control group tumors.   Figure 1C provides the estimated AUC (all times), which uses all timepoints with an example calculation for one trapezoid between time D 4 and D 5 . Figure 1D demonstrates the AUC (basic) calculation, which is simplified to ignore the data collected between start and end and, in this case, overestimates the AUC. The methods presented in Figure 1B-D would need to be repeated for each case included in the study. Figure 2 presents how to calculate the relative difference and TGII for a selection of "relative" approaches. Figure 2A introduces the individual-level data to be used in the example with 3 treatment cases and 3 control cases. Figure 2B demonstrates the calculations for the group-level outcomes that take the average volume within each group at study start and finish using all 3 observations within each group. Figure 2C  For noncontinuous outcomes, the binary outcome was dichotomized for tumors that grew more than twice their initial volume.
Additionally, a 4-category variable was evaluated following the RECIST criteria. 14 Lastly, the time-to-event outcome uses the number of days until reaching a doubling of tumor volume.

| Simulation parameters
The performance of the outcomes was compared using simulated tumor growth over time. Tumor volume was generated form a nor- for a trend in tumor growth, T, and the starting tumor volume, S. The simulations were repeated under each combination of the following scenarios: sample size (small or large), signal strength (null, small, or large mean), and amount of noise (small or large variance). All starting volumes were simulated from a common volume, S , of 200 mm 3 with a variance, 2 S , of 20 mm 3 . Small and large sample sizes were set at 10 and 20 tumors per treatment group, respectively. Control group tumor growth was set at 20 mm 3 per day (e.g., if initial volume is 200 mm 3 then final volume after 28 days equals 760 mm 3 ). The small and large mean difference in growth rates between groups, T (t) − T (c), or signal strength, was set at −5 and − 10 mm 3 , respectively (i.e., treatment group tumor growth was set at 15 mm 3 and 10 mm 3 per day). Variances for the growth trends, 2 T , were set at 1000 and 10 000 for small and large variances, respectively. For the k-nearest neighbor approaches, the closest 5 measures were used.
All simulations assume only one tumor per mouse. Data were simulated in R v3.6.0 (Vienna, Austria) across 10 000 simulated trials for each scenario. TA B L E 1 Summary of outcomes presented with mathematical formula; notation defined in footnote Outcome type Outcome Notation Abbreviations: AUC, area under the curve; TGII, tumor growth inhibition index.

| Evaluation of outcome performance
Outcomes are evaluated with respect to their power, type-1 error rate, relative bias of the mean, and relative error of the variance.
Relative bias of the mean and relative error of the variance were For statistical testing, the continuous outcomes were compared via univariate linear regression. The "individual" continuous outcomes were compared between treatment groups, but "relative" continuous outcomes used an intercept-only model, after first subtracting 1 from each "relative" continuous outcome to shift to all outcomes in order to facilitate ease of interpretation since it tests the null value 0 instead of 1 (since, for TGII, the null hypothesis is Δ(t) Δ(c) = 1). The shift transformation was chosen because the log-transformation is not always mathematically tractable since TGII can be less than zero (i.e., if the tumor size decreases). Both categorical outcomes were tested using either a chi-squared test or a Fisher's exact test, depending on resulting counts per category. The time-to-event outcome was evaluated with a Cox proportional hazards model (using the coxph function from the survival package in R).

| Application to real-world data
The data used to illustrate and implement these approaches examined the effect of the combination of AZD1775 provided by AstraZeneca or analysis if the initial starting volume was less than 70 mm 3 or greater than 500 mm 3 . Given the small sample size within each treatment group, a simplifying assumption that all tumors were independent of each other was made for the real-world data.
The study was carried out in accordance with the National Institutes of Health (NIH) guidelines for the care and use of laboratory animals, and in a facility accredited by the American Association for Accreditation of Laboratory Animal Care. Approval from University of Colorado Animal Care and Use Committee was obtained before the initiation of experiments. All mice were female athymic nude mice.

| Outcome evaluation
We highlight results from the simulation studies to identify the optimal outcomes based on each measure of performance, with Supporting Information presenting all results from the 8 simulation scenarios with respect to bias and variance (Table S1), type-1 error (Table S2), and power (Table S3). Further, to help visualize the comparisons of the outcomes, histograms of both the relative bias and the relative error of the variance are shown in Figures S1 and S2 (for a single scenario: small sample size, large mean difference, large variance).
Good performance (assessed across all 8 simulation scenarios) for bias and variance was defined as having less than 3% average relative error, with the 95% confidence interval maintaining a narrow width (<10%) while covering 0%. Good performance for the type-1 error rate was defined as <6%, where ideally the type-1 error should be 5% (the level). Good performance for power was subjectively labeled for each scenario on the basis of the results observed across all 18 outcomes.
In Table 2, the 5 estimators for each TGII and relative difference were grouped together, and the categorical outcomes were grouped together owing to similar results across each set of outcomes.

| Results for bias
All continuous outcomes except for TGII had good performance ( Table 2). The largest bias was found in the scenario with a small sample size, small mean difference, and a large variance, though measures of TGII also had large biases in other scenarios (Table S1). In general, the large-variance scenario showed more biased results. All 5 TGII estimators were more biased than the other continuous outcomes, where generally the random pairs estimator and the matched pairs #2 estimator were the worst of the TGII outcomes with respect to bias (Table S1). The TGII outcomes were the only outcomes to produce extreme outliers for the relative bias ( Figures S1 and S2).
It should be noted that some estimators are equivalent for the estimation of the mean difference, as expected. Namely, the biases were equivalent for the following pairs: (1) the group-level TGII and the common denominator TGII; (2) the final difference and 3 of the relative difference estimators (group-level, random pairs, and common difference); (3) both matched pairs for relative difference.

| Results for variance
For "individual" continuous outcomes, the final volume, final difference, and AUC (basic) outcomes consistently had the lowest relative error for variance, with less than 1% relative error from the true variance on average for all scenarios. The AUC outcome using all timepoints had an inflated variance across all scenarios (overestimated by 4%-5% on average). The final ratio outcome underestimated the true variance in small-variance scenarios but overestimated the true variance in large-variance scenarios.
The relative difference outcomes that did not use the k-  Table S1).
Similarly, the group-level TGII estimator was generally unstable for large variances, but only overestimated for small variances by 3%-6% on average ( Table S1). The best-performing TGII estimator for variance estimation was the common denominator, where it was also generally unstable for large variances, but only overestimated for small variances by 1%-3% on average (Table S1). Both the grouplevel and common denominator TGII estimators had narrower CIs and fewer occurrences where the variance estimate was >1000% error (Table S1).

| Results for type-1 error
Stable type-1 error rates were labeled as such for simulation results with <6% false positives across all scenarios. All "individual" continuous outcomes and the time-to-event outcome had stable type-1 Not evaluated a Good (+) type-1 error rates were defined as less than 6% false positives (i.e., less than 1% inflated).
Good power was defined separately for each scenario as being within 10% of the highest observed power, which was set from the outcomes with stable type-1 error rates. Both good bias and good variance were defined as having less than 3% average relative error, having the 95% confidence interval covering 0% error, and having the confidence interval spanning less than 10%. To be represented as good (+) in this table, the outcome must be "good" in all 8 simulated scenarios. Minus signs (−) denote "poor" results, which was defined as the opposite of good. Type-1 error for the 2 categorical outcomes (binary and the 4-category RECIST) had type-1 error rates between 0% and 3% (expecting 5%), which may be a result of extremely poor power for these outcomes. c Summarized for all 5 estimators (more specific results can be found in Supporting Information Tables S1-S3). It should be noted that some estimators have equivalent type-1 error results. The 2 "relative" estimators that use the control group mean as the relative value (i.e., common denominator and common difference) were equivalent. The 2 k-nearest neighbors relative difference estimators were also equivalent.

| Results for power
The results for power are shown in decreasing order for sets of outcomes: (1) those with controlled type-1 error rates, and (2) those with inflated type-1 error rates. First, for those with controlled type-1 errors, the group-level TGII estimator had the highest power (Table S3), followed by the final volume, final difference, and AUC (basic). The next-highest power was seen for the final ratio and both the random pairs and group-level relative difference estimators, followed by the AUC with all timepoints. The worst power for continuous outcomes was observed for the random pairs TGII estimator.
The time-to-event outcome had less power than all continuous outcomes, but was still higher than the extremely low power observed for the 2 categorical outcomes (Table S3). Second, for those with inflated type-1 errors, the 2 "relative" estimators that use the control group mean as the relative value (i.e., common denominator and common difference) had the highest power. Both matched pairs estimators for relative difference had the next-highest power, followed by the matched pairs estimators for TGII.
Similar to the type-1 error results, it should be noted that some estimators are equivalent for the power results. The 2 "relative" estimators that use the reference group mean as the relative value (i.e., common denominator and common difference) were equivalent. The 2 k-nearest neighbors relative difference estimators were equivalent.

| Application to real-world data
Only one PDX line was used ( Figure 3) (Table 3)

| DISCUSS ION
The simulation results and data application provide evidence that certain outcomes perform better with respect to statistical considerations such as the bias of the estimator, its variance, and the ability to detect a difference (power) or lack thereof (type I error).
If one outcome is to be selected, we would recommend prioritizing individual-level measures of the final volume or difference.
Alternatively, the use of a time-to-event measure may be advantageous given its performance in the simulation studies and the ability to estimate the median or mean time until an event occurs within each group. However, since follow-up measurements are set at fixed intervals, the day that a tumor reaches a certain threshold may not actually be observed and reflect interval censoring. Thus, the timeto-event outcome may be an overestimate of the true time until the event, for example, when the tumor has the event at day 5 but it is not observed until its follow-up measurement on day 7.
Generally, TGII is not recommended for use as a statistical outcome given its poor performance across these measures. However, if there is a strong desire to use TGII as the outcome, then the common denominator estimator (or the group-level estimator since these 2 are equivalent for estimation of the mean) would be the best option.
Additionally, the 2 categorical outcomes are further scrutinized because of the limited interpretation that these provide and the loss of information when using such outcomes. Sharma, Maitland, and Ratain previously discussed that using the RECIST categories as the primary outcome is inadequate in many scenarios. 17 When considering unplanned missing data, the most common source comes from euthanized mice, which poses a problem for the more straightforward univariate analyses. Unfortunately, the common approach used in cross-sectional studies of imputing with the last observation carried forward (LOCF) is problematic. 18 Fortunately, we observe the cause of the missing data (i.e., we have the previous volume that crossed the threshold to euthanize), and there are analysis methods that adequately handle this type of missingness in a multivariable regression framework with maximum- where log(x/y) becomes log(x) − log(y)). Also, certain situations may also render some outcomes useless for comparison, such as the binary out- concern. This is similarly reflected in the results of Oberg et al., who explored the use of linear mixed-effects models with complex correlation structures in the context of ovarian cancer PDX studies. 20 Additional work has examined the need for unified frameworks of PDX studies and their evaluation to better ensure replicability and optimal use of study data. 5 a First, the 5 estimates for the "individual" continuous outcomes are mean differences between treatment groups, where a positive estimate represents larger tumors in the treated group compared with the control. Second, the 5 estimates for TGII directly represent the estimand of interest Δ(t) Δ(c) ; values greater than 1 represent larger tumor growth in the treated group compared with the control. Third, the 5 estimates for relative difference are mean differences between treatment groups, which is also the estimand of interest (Δ(t) − Δ(c) ); positive values represent larger tumor growth in the treated group compared with the control. Fourth, estimates are not provided for the categorical outcomes because all tumors had the same fate for both treatment groups. Lastly, the estimate for the time-to-event outcome is a hazard ratio for the treatment group compared with the control group; a value less than 1 represents a lower hazard of the tumor doubling compared with the control.

AUTH O R CO NTR I B UTI O N S
LP and AK made substantial contributions to the conception and design of the work. All co-authors contributed to the drafting of the work and approval for final submission.

FU N D I N G I N FO R M ATI O N
Alexander Kaizer was supported by NHLBI K01-HL151754

CO N FLI C T O F I NTE R E S T
The co-authors have no conflicts to disclose.