On the properties of the toxicity index and its statistical efficiency

Cancer clinical trials typically generate detailed patient toxicity data. The most common measure used to summarize patient toxicity is the maximum grade among all toxicities and it does not fully represent the toxicity burden experienced by patients. In this article, we study the mathematical and statistical properties of the toxicity index (TI), in an effort to address this deficiency. We introduce a total ordering, (T‐rank), that allows us to fully rank the patients according to how frequently they exhibit toxicities, and show that TI is the only measure that preserves the T‐rank among its competitors. Moreover, we propose a Poisson‐Limit model for sparse toxicity data. Under this model, we develop a general two‐sample test, which can be applied to any summary measure for detecting differences among two population of toxicity data. We derive the asymptotic power function of this class as well as the asymptotic relative efficiency (ARE) of the members of the class. We evaluate the ARE formula empirically and show that if the data are drawn from a random Poisson‐Limit model, the TI is more efficient, with high probability, than the maximum and the average summary measures. Finally, we evaluate our method on clinical trial toxicity data and show that TI has a higher power in detecting the differences in toxicity profile among treatments. The results of this article can be applied beyond toxicity modeling, to any problem where one observes a sparse array of scores on subjects and a ranking based on extreme scores is desirable.

The CTC has evolved over the last decades to the current and wide-spread common terminology criteria for adverse events, 3 which is organized in 26 system organ classes, such that each AE term is defined and classified into grades of severity denoted as 0, 1, … , 5 with 0 corresponding to no symptoms, 1 corresponding to a mild symptom, up to 5 indicating death. While CTCAE allows investigators to assess and document toxicities in a systematized manner, it also creates a challenge in summarizing a large amount of data at the patient level. Each patient can experience more than one AE term from different organ systems and different grades of the same AE term during several cycles of treatment resulting in a vector of integers, the toxicity profile, of varying length containing all toxicities grades per patient. In recognition of deficiencies in toxicity reporting, NCI launched a Cancer Moonshot funding opportunity 6 to accelerate research on improved approaches to evaluating the tolerability of cancer treatments.
Currently, the most used approach in phase III trials to summarize multiple toxicities per patient is the maximum-grade (max-grade). [7][8][9] Other approaches such as those of References 10-12 have been proposed in the literature to address the lack of representation of the toxicity burden experienced by patients when the max-grade method is applied. In particular, a longitudinal approach (Tox-T) 12 incorporating low-grade toxicities and their duration using the average as a summary measure was introduced. Even though Tox-T better represents the toxicity burden, the average grade is hard to interpret clinically by investigators. In the setting of dose finding in early phase cancer trials, many authors proposed statistical models and dose escalation designs that take into account all grades and types of toxicities with the goal of improving the safety and efficiency of the trial. [13][14][15][16][17][18] Some of these methods use multivariate models for characterizing the relationship between different grades of toxicities and dose while others, such as the Q-TWIST, 19 summarize the information from all toxicity grades into a continuous score. In general, under some scenarios for the location of the true maximum tolerated dose (MTD), a modest gain in the precision of the estimate of the MTD is achieved when including information from all toxicities relative to models that use a binary indicator of toxicity (a.k.a. dose limiting toxicity), without compromising the safety of the trial.
The toxicity index (TI), proposed in Reference 20, is a summary measure that preserves the highest grade while incorporating lower grade toxicities. The TI can avoid the loss of information and improve clinical interpretability. It was recently shown that TI is more powerful than other common toxicity summaries in detecting differences among treatments 21 in the National Surgical Adjuvant Breast and Bowel Project clinical trial (NSABP R0-4). 22,23 To quantify the information loss, power curves were empirically estimated by resampling the available data, showing the relative performance of different toxicity summaries in comparing treatments.
While the TI was introduced more than a decade ago, its mathematical and statistical properties have never been studied. In this article, we report on the novel characteristics of TI as a summary measure and its effectiveness in comparing treatments. We also develop a framework for modeling and ranking toxicity data, which as discussed below, is applicable beyond drug toxicity problems. Our contributions in this article are as follows: 1. We propose a framework for modeling the data in experiments where scores on multiple events are recorded for a collection of subjects. In particular, the model is applicable to n × d arrays of scores on n subject and d adverse events, where each entry is a score in {0, 1, … , K}. We refer to the elements of the latter set as grades. We consider sparse arrays where most of the scores are zero, and argue in favor of a Poisson-Limit model that simplifies downstream analysis. The model can be equivalently expressed in terms of the reduced frequency vectors (RFV) of the subjects where, for each subject, one records the frequency of each observed grade except zero. 2. We introduce a total ordering, referred to as the T-order, on the space of RFVs, that allows for comparing them based on how frequently they exhibit extreme grades. Since the T-order is a total order (ie, every pair of vectors are comparable), it induces a full ranking among RFVs which we call the T-rank. This, for example, allows one to fully rank subjects (or treatments), strictly and without ties, except when the RFV vectors are exactly equal. The T-rank, in particular, is relevant to drug toxicity trials where one wants to emphasize differences in extreme toxicity. The ranking is also useful in any application where variations in the extreme scores among subjects or treatments are of concern. 3. We consider two-sample testing based on score data coming from two different Poisson-Limit models. We propose a general test that compares the two treatments by looking at the difference in mean between values of a summary measure g(⋅) applied to each subject in each treatment. We study three candidates for g(⋅): the TI, the average, and the maximum score. We derive exact expressions for the mean and variance of the test statistic under the Poisson-Limit model, allowing us to analytically set the critical region of the test with guarantees on the asymptotic significance level. 4. Our result also provides an approximate analytical formula for the power function for each test, as well as an exact expression for the slope of the test as defined by Vaart 24(chapter 14) . The slopes allow us to analytically calculate the asymptotic relative efficiency (ARE) of the three summary measures with respect to each other. We evaluate these ARE expressions empirically, showing that in the majority of the cases, TI is more efficient than the maximum and much more efficient than the average. 5. We demonstrate our theoretical results on the two-sample test by fitting the Poisson-Limit model to real data from a clinical trial and evaluating the power function for tests based on the three summary measures. The resulting plots complement those of Reference 21 which are obtained based on resampling the data, and confirm the superiority of TI in detecting the differences in toxicity profile among drugs.
The organization of the rest of the article is as follows: Section 2 provides background on the toxicity data and the summary measures. We then introduce the Poisson-Limit model for such data. In Section 3, we derive mathematical properties of TI, including monotonicity with respect to the T-rank. Deficiencies of competing summary measures in preserving the ranks are also discussed here. Section 4 develops our two-sample test and presents analytical results on the power function and test slopes. The methodology is illustrated with simulations in Section 5 and conclude with a discussion in Section 7.
Notation. We use Z for the set of integers, R + for the set of nonnegative real numbers, and Z + for the set of nonnegative integers. We write [K] = {1, … , K} and [K] * = {0, 1, … , K}. The indicator of a set A is denoted as 1{x ∈ A}, evaluating to 1 if x ∈ A and 0 otherwise.

DATA, MODELS, AND SUMMARY MEASURES
Assume that we have a collection of d AEs that we represent as {1, … , d}. For each subject in a given period of time (one cycle or multiple cycles combined), we observe a toxicity profile that can be viewed as a vector where Y i represents the toxicity grade of AE i. Generally, we assume that there is a maximum possible grade K, so that we can assume Y ∈ [K] d * , where we recall that [K] * = {0, 1 … , K}. We assume that the corresponding grades for different AEs are equivalent, that is, grade 1 in Y 1 is equivalent to grade 1 in Y 2 , and so on. The toxicity index is an example of a summary measure that maps Y to a scalar grade. More precisely, let Y (1) ≥ Y (2) ≥ · · · ≥ Y (d) be the order statistic of Y . The toxicity index was originally defined as a function ∶ Z d + → R + given by: 20 Besides the toxicity index, there are other summary measures, which are often defined in terms of the frequency vector of the grades, that is, X * = (X 0 , X 1 , X 2 , … , X K ) where X r = ∑ d j=1 1{Y j = r}. Since we are mainly interested in sparse toxicity profiles, level 0 has a special status. We will work with the reduced frequency vector X = (X 1 , … , X K ) that only retains the frequency of nonzero levels. Let us write X + = ∑ K r=1 X r and note that X 0 = d − X + , that is, there is no loss of information working with X instead of X * .
Common summary measures can be stated in terms of the reduced frequency vector X. The following two examples are of particular interest: 1. The mean index which can be represented as 2. The maximum index which is given by where we interpret the maximum of the empty set as 0.
Throughout, maximum index, mx, and max-grade will be used interchangeably.

The Poisson-Limit model
where X + = ∑ K r=1 X r and p + = ∑ K r=1 p r . We are interested in the cases where the toxicity profile is sparse, that is, many of {Y j } are often zero. In those cases, it is reasonable to assume that p r = r ∕d for r ∈ [K]. Letting = ( 1 , … , K ), our model is equivalent to As shown in the Appendix (Section A0.1), X converges in distribution to a product of Poisson distributions, that is, We will use this limiting distribution in our statistical analysis. It is a good approximation for large sparse toxicity profiles and allows us to derive explicit analytical expressions for various statistical quantities of interest. We refer to (5) as the Poisson-Limit model. Remark 1. Even when the distributions of the AEs are allowed to be different, that is, P(Y j = r) = q jr with q jr potentially varying with j, we still have a Poisson-Binomial distribution for each marginal (ie, the distribution of ∑ d j=1 1{Y j = r} for a given r), which can be approximated by a Poisson distribution in the sparse case.

TOXICITY INDEX PRESERVES T-RANK
In this section, We derive some analytical properties of the toxicity index. We first show how the toxicity index can be computed based only on the reduced frequency vector (RFV) of the observations. We define a total ordering, referred to as the T-rank, on the space of RFVs and show that the toxicity index preserves this ordering in a strict sense. A consequence of this result is that the toxicity index is a one-to-one mapping on the RFV space, hence there is no loss of information when summarizing the toxicity profile this way.

Closed-form representation
Recall that Y ∈ [K] d * represents the toxicity profile of a patient. Consider the truncated toxicity index: which only considers toxicities of grades k or lower. We have The following proposition provides a recursive formula for k (⋅) in terms of x: is the reduced frequency vector of Y and 0 (⋅) ∶= 0.
It follows from Proposition 1 that the toxicity index , defined in (1), has the following closed form in terms of the frequency vector: Letting g k (x) ∶= 1 − g k (x), we can write TI(⋅) alternatively as follows: , one can also rewrite (6) as which could be faster for computations. In the sequel, we also refer to TI(⋅) as the toxicity index, based on the equivalence just established.

Monotonicity and T-rank preservation
We now introduce a total order on the space of reduced frequency vectors (RFVs). In fact, the total order can be defined over all real-valued K-dimensional vectors: In words, for any two vectors x and y that are not equal, we let i(x, y) be the largest grade (last position) on which they differ and say that x ≻ y if the coordinate of x at i(x, y) is larger than the corresponding coordinate of y. We write x ≽ y if either x = y or x ≻ y, and similarly for x ≼ y.
The T-order, defined by ≽, is a total order on R K , that is, we can compare any pair of vectors in R K . To see this, it is enough to verify the transitivity property: Recall that the toxicity index can be written as a function TI(⋅) of the reduced frequency vector; see (6). The toxicity index TI(⋅) is (strictly) increasing on Z K + , with respect to the T-order: . Since the T-order is a total order, it provides a ranking of all elements of Z K + . We refer to this ranking as the T-rank. Theorem 1 shows that the toxicity index preserves the T-rank. Theorem 1 also shows that TI(⋅) is injective (ie, one-to-one) on Z K + . The significance of the T-order, and the associated T-rank, is that it is sensitive to the highest grade on which the two RFVs x and y differ. This is the desired way to compare RFVs derived from toxicity profiles, since the highest grade toxicities are the most important. The two other common summary measures, namely, mx(⋅) and avg(⋅) do not preserve the T-rank: The max-grade mx(x) is a nondecreasing function w.r.t. the T-rank but is not one-to-one, that is, we can have x ≻ y but g(x) = g(y). The average function avg(x) is neither monotone w.r.t. to the T-rank nor one-to-one. For example, consider K = 2, and take x = (6, 2) and y = (2, 1). Then, x ≻ y but avg(x) = 1.25 < avg(y) = 1.33. It is also easy to find cases where x ≻ y and avg(x) > avg(y) or avg(x) = avg(y), for example, take x = (2, 2) or x = (4, 2), respectively. Figure 1 illustrates the TI vs F I G U R E 1 TI preserves the T-rank while the mx and avg do not. The x-axis is an array, with 37 patients with each column representing a RVF for an individual, ordered in increasing T-rank. Each cell in the array is the number of AEs experienced at a particular grade for a given cycle of treatment. The TI exhibits strict monotonicity with respect to this ordering while the mx is nondecreasing and avg is neither monotone nor nondecreasing [Colour figure can be viewed at wileyonlinelibrary.com] mx and avg as functions of some sorted vectors based on T-rank showing that TI is both monotone and one-to-one unlike the mx and avg and thus preserves ranking.
Intuitively, the T-rank is a natural way to order patients according to their toxicity grades. We look at the largest grade first, and if the frequency of that grade is higher in one patient we rank their toxicity higher. Otherwise (ie, in case of equality), we look at the second largest grade and compare the frequencies there and so on. As an example, in the hypothetical dataset, discussed in Section 6.1, the RFVs of patients 7 and 8 are x = (0, 1, 3, 1, 0) and y = (0, 4, 1, 1, 0), respectively. These two match on grades 5 and 4 but not 3. Then, the largest grade that they differ on is grade 3, that is, i(x, y) = 3. Since x 3 > y 3 , we conclude that patient 7 has a higher T-rank than patient 8. This way we can rank all patients. The mx function when plotted against this order looks like a step function while TI always monotonically increases, as shown in Figure 1. The avg mixes the high and low grades which generally causes the overall grade to go down, hence misrepresenting the toxicity as determined by the T-rank; this is shown in Figure 1 as average occasionally going down even when the toxicity rank goes up.

STATISTICAL THEORY
In this section, we consider the statistical problem of testing whether two populations of toxicity profiles are different. This is the central problem of pharmacology where one is interested in determining whether two or more drugs have the same toxicity effects in patients, or whether a certain drug is more toxic than the other. We present the problem under the Poisson-Limit model of Section 2.1, and introduce a general two-sample test that can be applied based on any scalar summary measure. We present an asymptotic analysis of these tests (Theorem 2) and compute the relevant parameters for the three summary measures: the average, the maximum, and the TI.

Two-sample tests
Recall the Poisson-Limit model of Section 2.1 for the reduced frequency vectors (RFVs). Assume that we observes RFV samples from two such populations: where = ( r ) and = ( r ) are vectors in R K + . We recall that Poi( ) = ∏ r Poi( r ) is the distribution with independent Poi( r ) coordinates. Our goal is to test the null hypothesis H 0 ∶ = . We consider the general statistic for some mapping g from R d to R. Let us define where E and var denote the expectation and variance, assuming that X ∼ Poi( ). We assume that M g ( ) and 2 g ( ) are finite for all ∈ R K + . Consider the two-sample test that rejects the null hypothesis if wherêis the following estimatê= The next theorem describes the asymptotic behavior of this test: (8), with some , ∈ R K + . Assume that n, m → ∞ such that n∕m → and that 2 g (⋅) is a continuous function. Then, the two-sample test that rejects the null according to (10), has asymptotic level . Moreover: (a) If M g ( ) ≠ M g ( ), the test is asymptotically consistent (ie, the power converges to 1). (b) If M g ( ) = M g ( ), the test is inconsistent and its power converges to In particular, if in addition 2 g ( ) = 2 g ( ), the test is asymptotically powerless (ie, the power converges to ). (c) Assume that M g (⋅) is continuously differentiable at . For the shrinking alternative, = + ∕ √ n with ∈ R K , the power of the test converges to The quantity B 2 , with B given in (12), is the slope of the test in the direction . The slope plays a role in determining the Pitman asymptotic relative efficiency (ARE) of two test with respect to each other. In particular, the ratio of the slopes for two tests determines the asymptotic ratio of the samples sizes necessary for achieving the same power by two tests of a given size; see Reference 24, chapter 14 for details. The proof of Theorem 2 also gives the following asymptotic approximation to the power of the test: where = M g ( ) − M g ( ), = z ∕2 g ( ) √ 1 + and 2 = 2 g ( ) + 2 g ( ). Simulations in Section 5 show that this approximate formula is quite accurate even for small n. Remark 2. One can replace 2 g (̂) in (10) with any consistent estimate of the variance of the pooled sample (under null). Theorem 2 still holds for such a test. In particular, we can replace 2 g (̂) with the empirical variance of the pooled sample: . The resulting test will be the usual two-sample t-test and it enjoys the same asymptotic properties as those of (10). There could be some advantage in using 2 g (̂) vŝ2 for small sample sizes, but it is generally hard to quantify the difference since both estimates quickly approach the true variance under the null. The main utility of computing the functions 2 g (⋅) and M g (⋅) is that they enable us to obtain explicit expressions for the asymptotic power of the tests and the AREs.
To implement the test (10), one needs the variance function 2 g (⋅). To compute the slope, we additionally need the mean function M g (⋅). Below we derive exact expressions for these quantities for the three summary statistics of Section 2, that is, g ∈ {TI, mx, avg}. To simplify the notation, let + = ∑ K r=1 r , r = r ∕ + , and = ∕ + . Note that is a probability distribution on [K]. Let us denote the first and second moments of this probability distribution as We also define the function where Ei is the exponential integral and is the Euler-Mascheroni constant. It is known that 0 ≤ Er( ) − [1 − (3 ∕4)] ≤ 11 2 ∕36 for ≥ 0, hence e − Er( ) → 0 as → ∞. This implies that for sufficiently large + , the variance avg ( ) is nearly completely determined by 1 ( ). Thus equality of the first moments of and together with + = + ≫ 1 is enough for the mean index test to be almost powerless. Corollary 1, in fact, suggests that rather that the null hypothesis of H 0 ∶ = , the test based on avg is appropriate for testing the following null:

Lemma 2. Under a Poisson model, X
Remark 3. A counterintuitive consequence of Lemma 2 is that there are examples of rate vectors and , such that ≥ coordinatewise, but M avg ( ) < M avg ( ). See Section 6.1 for such an example.
Next, we consider the maximum index: For z = (z 1 , … , z K , z K + 1 ), with z K + 1 = 1, define See Section A0.3 for a derivation of the second equality.

Computing the slope
The slope of the test (10) is B 2 , where B is defined in (12). To compute the slope, we need the gradient of the mean function. Let us consider the case of mx. The mean function is of the form where ′ (z) = e −z is applied coordinatewise to vector W . The mean function for the TI statistic has a similar form: is an upper triangular matrix with u ij = j/(j + 1) for j ≥ i. The rest of the calculations follow similarly. For the avg, the ith element of the gradient of the mean function is We will use these expressions to empirically evaluate the slopes of the three summary measures, and compare their asymptotic relative efficiencies.

SIMULATIONS
We start by investigating the asymptotic relative efficiency (ARE) of TI w.r.t. the mx and avg summary measures. To do so, we consider the two-sample Poisson-Limit model (8) with K = 5 and generate the mean vectors, and , randomly as follows: is drawn from ∏ K i=1 Unif(0.01, 2) and is set equal to + , with ∼ ∏ K i=1 Unif(0.05, 0.3). We generate 10 4 such parameter pairs and for each model, evaluate the ratio of the test slopes for TI vs mx and avg as a measure of the ARE, that is, we calculate B 2 TI ∕B 2 mx and B 2 TI ∕B 2 avg . Figure 2 illustrates the resulting histogram for the two slopes. Values above one indicate a higher efficiency for TI relative to the competing method. The plots show that, in the majority of cases, TI has a higher asymptotic efficiency compared to mx and avg. The histogram for B 2 TI ∕B 2 avg has a heavier tail than B 2 TI ∕B 2 mx which indicates that there are cases that TI achieves a much higher efficiency relative to avg (compared with mx). Under our sampling scheme, the probability that TI has a larger slope than the mx and the avg is 0.97 and 0.99, respectively.
Next, we consider specific values for and , namely = [0.5, 0.75, 1, 0.75, 0.5] and = [0.60, 1.05, 1.50, 1.05, 0.60], and investigate the receiver operating characteristic (ROC) and the power curves for the three tests. The ROC is obtained by plotting the true positive rate (TPR) achievable for any given false positive rate (FPR). Note that TPR is an alternative F I G U R E 2 Comparison of the asymptotic relative efficiency, as the ratio of test slopes, for the TI vs mx (left), and TI vs avg (right). The plots are the histograms of the ARE under a random Poisson-Limit model, described in Section 5. The portions colored red correspond to models where the ARE of TI is lower than the competitors [Colour figure can be viewed at wileyonlinelibrary.com] F I G U R E 3 ROC curves (left) and the simulated and asymptotic power plots at significance level = .05 (right). The dashed lines, in the right panel, denote the asymptotic power functions [Colour figure can be viewed at wileyonlinelibrary.com] name for the power; similarly, FPR is another name for the test size. The power curves are obtained by plotting the power of the test of size = .05, against the sample size n. Figure 3 shows the ROC curves of the TI, mx and avg for n = 100 and also the asymptotic and simulated power plots for 1000 repetitions, illustrated with the dashed and solid lines, respectively. The values of and are chosen to be close to the setting of Corollary 1. In particular, we have + = 3.5, + = 4.8 and for the avg test, 1 ( ) = 1 ( ) = 3, 2 ( ) = 10.57, 2 ( ) = 10.44. Corollary 1 explains the poor performance of the avg in this setting, matching what Figure 3 shows: The avg is nearly powerless for this testing problem.
We note the close match between the asymptotic power curves calculated based on (13) and the simulated curves. If we increase the number of repetitions further, the simulated power perfectly match the asymptotic one. In this example, TI outperforms the mx and the avg, in terms of power, for all false positive rates, and achieves a given power, at size = .05, with a smaller sample size.

TRIAL APPLICATION
In this section, we first demonstrate how the two-sample tests discussed in Section 4 can be implemented using a simple trial with hypothetical data. Then, we show the results of application to a real clinical trial.

Hypothetical data
Consider a simple hypothetical trial where n = m = 5 patients were assigned to each of two treatment groups, with toxicity data given in Table 1. The data were randomly generated from Poisson-Limit models with = (0, 5, 1, 5, 0) and = (0, 1, 1, 1.5, 0) and the patients are ordered in decreasing T-rank. Note that the first drug is more toxic on average, in the sense that ≥ coordinatewise, that is, the mean of vector of one distribution dominates the other in every coordinate. The pooled rate estimatêis obtained by taking the average of each grade column in Table 1, givinĝ= (0, 3.3, 1.1, 2.9, 0). TA B L E 1 Frequency vector representation for the hypothetical example in Section 6.
using the formula in Lemma 4 to calculate 2 TI (̂) = 0.157. At level = .05, the test based on TI rejects the equality of treatments since S TI ∕se(S TI ) = 3.13 > 1.96 = z ∕2 . Similar calculations can be performed for the tests based on mx and avg, as shown in Table 2. Both tests fail to reject the equality of treatments since S mx /se(S mx ) = 1.89 and S avg /se(S avg ) = 1.07 are both less than z ∕2 = 1.96.
Note, from Table 2, that the standard error of the two statistics S TI and S mx are roughly the same in this case. However, the difference in group means is much larger for TI relative to mx (|S TI |≫ |S mx |) due to patients in group 1 having many more grade 4 toxicities relative to group 2. The TI takes the frequency of grade 4 toxicities into account while the mx ignores it.
We can also use formula (13)  Similarly, for the mx, note how close the true means are to the empirical ones (cf Table 2). The fact that for treatment 1, 4 is quite large-hence most patients show at least one grade 4 toxicity-is reflected in the small value of the variance 2 mx ( ). That is, for the majority of patients in group 1, mx will be 4, hence its mean will be concentrated near 4. In this case, we have 2 = 0.712 and = 0.357 giving an approximate power of 0.800.
For the avg, group 1 variance is higher than both TI and mx, though still quite small relative to group 2. The mean difference is very small in this case, = −0.048. The negative sign indicates that treatment 1 is less toxic on average than treatment 2 (in expectation), which is quite counter-intuitive since treatment 1 is more toxic in terms of the extreme grades. Even more surprisingly, we have M avg ( ) < M avg ( ) despite ≥ coordinatewise. That is, the frequency of all the grades are higher on average for treatment 1, but its expected avg is lower. This can be confirmed empirically, for large n, by observing that S avg turns out to be negative most of the time. It can be explained by noting that 2 being much larger than 2 biases the avg for group 1 toward lower grades. This shows a clear deficiency of the avg in ordering toxicities. Plugging-in = −0.048 and 2 = 0.635 into (13), we obtain an approximate power of 0.295.

Application to clinical trial data
We use the data from the NSABP R0-4 rectal cancer clinical trial as a case study for comparing the power of the three summary statistics (TI, mx, and avg) in detecting treatment differences. NSABP R0-4 was a phase 3 trial conducted between July 2004 and August 2013 (NCT00058474). Eligible patients were diagnosed with surgically resectable stage II or III rectal adenocarcinoma. Patients were assigned to four different treatments: (1) infusional 5-fluorouracil (5FU); (2) oral capecitabine (CAPE); (3) combination of 5FU and oxaliplatin (OX); (4) combination of CAPE and Ox. The trial included 1608 participants. From these, 50 patients were ineligible or had missing data. The analytic sample consisted of 1558 individuals. CONSORT diagram is available in Reference 21. Main study description is available in Reference 23.
We use the power analysis to compare the performance of the three methods in detecting treatment differences. We first fit a Poisson-Limit model (see (5)) to each treatment data which amounts to estimating its rate vector using the sample mean of the corresponding RFVs; see for example (11), but applied only to a single sample. Once the model for each treatment is specified, we can compare the methods based on their power in differentiating pairs of treatments. For each method, we plot the power vs the sample size. These plots can be obtained by simulating from the fitted models (a form of parametric bootstrap) or by using the asymptotic power formula (13) which, as was shown in the previous section, gives accurate results even for small sample sizes. Figure 4 shows the power plots for testing four pairs of treatments that are known to be different. All tests are performed at 5% significance level. The plots show that TI has a greater power in detecting differences between treatments in all the four cases. That is, using TI, the required number of patients to detect treatment differences is smaller. We note that Reference 21 performs a similar power analysis on the same dataset. They provide no model for the data and instead use bootstrap to perform the power analysis. The tests they consider are based on Wilcoxson signed-rank statistic. In contrast, we provide a model for the data and use parametric bootstrap (or exact asymptotic power curves) to perform the power analysis. Our test statistics are also different and based on the difference-in-mean statistics. Compared with the results of Reference 21, Figure 4 shows that the test statistics considered here achieve a given power with a smaller sample size. We observe a similar relative ranking of TI, average and max-grade as that observed by Gresham et al. 21

DISCUSSION
In this article, we studied the mathematical and statistical properties of the toxicity index. We showed that the TI can be stated solely as a function of the reduced frequency vector, with a tractable closed-form formula. We introduced T-rank preservation as a desirable property that allows investigators to achieve clinically-meaningful ranking of the toxicity profiles. We showed that compared with competing summary measures (ie, the max-grade and the average-grade), TI is the only measure that preserves the T-rank and is an injective (ie, one-to-one) mapping on the space of toxicity frequencies.
Neither max-grade, nor the average-grade are injective or T-rank preserving. The max-grade loses information in the toxicity profile by only looking at the highest grade. The average-grade loses the information about the extreme toxicities by equally weighting low and high grades.
To statistically compare various toxicity measures, we proposed a Poisson-Limit model for modeling sparse toxicity data via their reduced (ie, 0-removed) frequency vectors. Under this model, we developed a general two-sample test for detecting differences among two population of toxicity data. We derived formulas for the asymptotic power of the test, for the three toxicity measures (TI, max-grade, and average-grade) and calculated their asymptotic relative efficiencies (AREs). We empirically demonstrated that TI has a higher ARE that the other two summary measures, with high probability, under a random Poisson-Limit model. We also illustrated that TI achieves a higher power in detecting treatment differences in a cancer trial, validating the empirical results of Reference 21. The framework we developed in this article allows investigators to compare toxicity-summary methods in any clinical trial, and to analytically evaluate new proposals for toxicity summaries.
The TI can be generalized to accommodate noninteger toxicity grades if the clinician believes that a toxicity has a different impact than the observed grade. 11,25 Furthermore, the TI can be used in any risk assessment application with a grade system similar to CTCAE, provided that the grades for different inputs (eg, items) are equivalent. In Reference 26, the TI was used to score surgical complications based on the Clavien-Dindo system. 27 More broadly, the analysis in this article applies to any application where integer scores on multiple events are recorded for a collection of subjects, assuming that the resulting score array is sparse. We are currently investigating the benefits of using TI for patient reported outcomes (PROs) data, 28 extending the TI for multiple cycles and searching for other T-rank preserving summary measures. p i = i ∕d for i = 1, … , K, we obtain the MGF of X, We have where m(x; ) = e (e x −1) is the MGF of a Poisson random variable with mean . The proof is complete.

A.2 Proofs of Section 3
Proof of Proposition 1. Assume that Y i = k for r k ≤ i ≤ r k − 1 for k = 1, … , K with 1 = r K ≤ · · · ≤ r 1 ≤ r 0 = n. Note that Then, which is the desired result. ▪ Proof of Lemma 1. Assume x ≻ y and y ≻ z. By assumption, x ≠ y and y ≠ z. We consider three cases: (i) i(x, y) = i(y, z) = : . Then, it is not hard to argue that i(x, z) = and since x > y > z , we have x ≻ z.
(ii) 1 : = i(x, y) > i(y, z). Then, i(x, z) = 1 and The proof is complete. ▪ Proof of Theorem 1. Assume K ≥ 2. Consider the following two vectors If is enough to show that TI(x) ≥ TI(y). In fact, we show equality. To simplify, let us write u = (1 + K) −m . Then, Note that all the lower terms are zero in the expansion. On the other hand, we have g K (y) = (1 + K) y K = (1 + K) m−1 and g k (y) = ∞ for k < K. It follows that TI(y) = g K+1 (y)(K + 1)(1 − (K + 1) −m+1 ) + K g K (y) Then, The proof for the general case follows by considering the largest grade on which x and y differ, which without loss of generality (WLOG) we assume to be K. Assume WLOG that x K > y K . Then, setting x j = 0 for all j ≠ K only decreases TI(x). Similarly, increasing y K to x K − 1 and increasing all y j = ∞, only increases TI(y). We have shown that even in this worst case TI(x) ≥ TI(y). So all the other cases follow. ▪