A tutorial on the use of exploratory efficacy outcomes in uncontrolled phase I cell therapy trials

Abstract Phase I cell therapy clinical trials evaluate the safety of novel biologic treatments and are often uncontrolled. Many of these studies also include exploratory efficacy outcome measures, which are frequently continuous measures of disease state or severity, or participant‐reported measures of symptom burden or quality of life. When such outcomes are included in uncontrolled phase I trials, they are typically serially assessed on the participants over time, and any improvement from baseline is interpreted as preliminary evidence of efficacy justifying a future, controlled trial. However, it is challenging to distinguish true improvement from regression to the mean in this design. The problem is exacerbated when trial entry criteria are based on extreme values of the outcome measure used to assess efficacy. It is possible to estimate the expected effect of regression to the mean when the natural history of the outcome measure is known, yet this is rarely done in practice. This article provides a refresher on regression to the mean for investigators designing early phase clinical trials in cell therapy and evaluates the potential for regression to the mean to have influenced conclusions drawn from recently conducted phase I cell therapy trials.

such outcomes are included in uncontrolled phase I trials, they are typically serially assessed on the participants over time, and any improvement from baseline is interpreted as preliminary evidence of efficacy justifying a future, controlled trial. However, it is challenging to distinguish true improvement from regression to the mean in this design. The problem is exacerbated when trial entry criteria are based on extreme values of the outcome measure used to assess efficacy. It is possible to estimate the expected effect of regression to the mean when the natural history of the outcome measure is known, yet this is rarely done in practice. This article provides a refresher on regression to the mean for investigators designing early phase clinical trials in cell therapy and evaluates the potential for regression to the mean to have influenced conclusions drawn from recently conducted phase I cell therapy trials. to communicate the statistical concern, namely, regression to the mean (RTM), 1 that complicates efficacy assessment using the single arm trial design. The tutorial is designed to be accessible to anyone who has a basic familiarity with statistics.
The tutorial begins with a definition of RTM accompanied by an illustration using an example data set, followed by a demonstration of the ubiquitous nature of RTM and an explanation of how to estimate the impact of RTM on study results. An example application of estimating RTM is given using a hypothetical single arm trial of a cell ther-

| DEFINITION OF RTM
In a colloquial sense, RTM refers to the phenomenon in which patients who have initially high values tend to have lower values when measured again later. 1 Likewise, patients who have initially low values tend to have higher values when measured subsequently. More generally, we can simply say that when serially assessing outcomes on participants, those who have extreme values at the first assessment will tend to have less extreme values at the second assessment. 2 We can formalize this concept using a statistical model as follows.
Assume that we are interested in evaluating changes in quality of life (QOL) over time among patients who have been treated with an experimental cell therapy. Suppose that we have a valid and reliable tool to measure QOL and this tool provides a continuous, normally distributed outcome measure for which higher scores indicated better QOL. Let Y i be the QOL score for a single subject at time i where the measurement time points are indexed from i = 1…k. The total number of measurements, k, and the spacing between measurements would be set based on the objectives of the study. At any time i the observed QOL is actually the sum of the subject's true QOL, T, plus some random error introduced by the measurement tool, E i , such that QOL for a single participant at time i is defined as is what the researcher observes, and although certain assumptions are made about T and E i , these variables are usually not observable. 3 To finish specifying the model for QOL we must identify the sources of variation in the observed values, Y i . There are two sources of variation: (a) the natural variation between patients in QOL, regardless of the time point at which patients are measured, and (b) the variation among measures of QOL within the same patient over time.
Patients' natural variation in QOL is accounted for by the distribution of T. We already stated that we will assume this variation is describable by a normal distribution. To make the example concrete we will assume T has a mean of 50 and SD of 5. Based on these assumptions we know that 95% of patients will have QOL scores in the range of 40 to 60. The second source of variation in the model-within-patient variation-is represented by the distribution of E i , which has, by definition, mean 0 and, for our example, a SD of 5. Therefore, for a given person who has true QOL of 60, we would expect the distribution of their observed values, Y i , to be centered at the true value of 60 with 95% of their observed values falling between 50 and 70. To formalize the concept that E i represents random noise in the measurement, we say that individual values of E i are not correlated with each other, and values of E i are also not correlated with the true value, T. 3 We can now understand RTM in light of this statistical model as shown in Table 1. 4 Assume that in the population of patients we would treat there are 1 000 000 people with QOL at the mean value of 50. A total of 10 000 patients have QOL = 40, and 100 have QOL = 30. The distribution is symmetrical, so the same numbers of patients have true QOL above the mean at values of 60 and 70. Next, assume that 98% of patients have observed QOL with measurement error equal to 0. Furthermore, assume that 1% of observed QOL have measurement error equal to −10 points and that 1% have measurement error equal to +10 points.
Reading across the rows in Table 1 we see that true values are at the center of the distribution of observed values, as expected.
For example, among patients with true QOL of 60 (fourth row of Table 1; T = 60) the distribution of observed values is centered at Y i = 60. In fact, if this were not the case, then the measurement tool would be seriously flawed! Reading down the columns of Table 1 shows something more interesting. Although this statistical model establishes first principles for understanding RTM, the actual effect of RTM in a longitudinal study is somewhat more complex. This is the subject of the next section.

| RTM IN AN EXAMPLE DATA SET
Suppose we conducted a single arm clinical trial in which we assessed a continuous, normally distributed QOL outcome measure on 50 patients. Again, we will assume that higher scores on this outcome measure represent better QOL. The first assessment was taken at entry into the trial, immediately prior to exposure to the experimental therapy. Assume for the sake of simplicity that there was only a single administration of the study product in this trial. The second assessment was taken later, at 3 months after the baseline assessment (the exact length of time is not important). Furthermore, suppose that QOL does not change over this period; that is, the distribution is identical-that is, the mean and SD are the same-at baseline and month 3 after treatment with the experimental therapy. Thus, the only fluctuation in the outcome measure over time is due to measurement error.
In other words, our QOL assessment will produce a slightly different result each time we apply it to the same person, even though the person's QOL has not changed. Note that this scenario is equivalent to what would be a true null hypothesis in this design, that is, that the population of patients treated with the experimental therapy would not experience any change in QOL over a 3-month period after exposure to the therapy.
Imagine that we obtain the results shown in Figure 1

| RTM is always reflected in imperfectly correlated serial measures
This example given above is not a special case. In fact, it is a scenario that is destined to occur whenever dealing with imperfectly correlated outcome measures. This can be understood easily through revisiting some basic concepts in linear regression analysis. We begin with some definitions as follows.
Let X = {x 1 , x 2 , …, x N } and Y = {y 1 , y 2 , …, y N } be the first and second assessment of a normally distributed outcome measure in a group of N participants followed over some arbitrary period of time. We can describe the relationship between the mean value of Y (the reassessment) and any given value of X (the initial assessment) using the familiar linear regression equation The regression equation describes a straight line through the data points (imagine the data arranged in a scatterplot such as that shown in Figure 1) where the y intercept, α, and the slope of the line, β, are determined such that a "best fitting line" is identified. A common method for estimating these parameters is the method of least squares, from which we know closed form solutions for the intercept and slope. One way of writing these equations is as follows 5 : there is no RTM. However, whenever there is imperfect correlation, that is, jr xy j < 1, then any given participant's reassessment (Y) is closer to its mean than the same participant's initial value (X) was to its mean. Therefore, when X and Y (initial and reassessment) are not perfectly correlated, there will always be RTM.
Unfortunately, the problem of RTM is not solved by expressing the outcome as a change from baseline. Using intuition based on the information presented thus far, one can imagine that participants who start the trial with extreme values will tend to have large amounts of change. The mathematics of the relationship between change and initial value are beyond the scope of the present article but are available elsewhere for interested readers. 6 Finally, the effect of RTM can be quite large when trial entry criteria are based on extreme values of the outcome measure. 2 This last point will become readily apparent in the next example, which demonstrates how to predict the amount of RTM that might be expected based on knowledge of the natural history of the outcome measure.

| ESTIMATING EXPECTED RTM
There are some therapeutic areas in which the natural history of meaningful outcome measures is known. In these settings it is possible to estimate the amount of RTM that would be expected in the absence of any treatment effects. 7 First, let k be the value of the outcome measure for which we are interested in estimating RTM. For example, in Figure 1 we evaluated RTM in participants with baseline QOL of k = 16. Next, we must know the mean, μ, and SD, σ, of the outcome measure in the population. These values could be obtained from a natural history study. Next, we can use this information to create a ratio that compares the probability of the value k in the population vs the probability of values greater than k in the population as follows: Unfortunately, the complete set of information required for the above calculations is seldom available in the literature. However, provided we can at least obtain information about the percentile rank of the cut point k in the population, we can use the data from our study to estimate μ, σ, and ρ xy , thus allowing us to use the above equation to estimate the expected amount of RTM. The details of this procedure are covered elsewhere and are not necessary to address for the purposes of this example. 7 Finally, it should be noted that this method for estimating expected RTM is applicable specifically to normally distributed continuous outcomes. Although RTM may also manifest in ordinal or binary outcomes, some modifications to the above method may be required to evaluate the magnitude of RTM in those settings.

| AN EXAMPLE USING A HYPOTHETICAL TRIAL TESTING A NOVEL CELL THERAPY FOR MS
MS is a disease for which the natural history of a clinically meaningful outcome measure has been documented. The Kurtzke Disability Scale (DSS) is an ordinal scale represented by the integers from 1 (least disabled) to an implicit assumption here that the natural history study we refer to is an accurate reflection of the population we are sampling in our trial; investigators must be careful of this, especially in heterogenous disorders such as MS). Unfortunately, the natural history study did not publish estimates of μ, σ, and ρ xy , so we will rely on the data collected in our trial to estimate these parameters and calculate expected RTM using published formulas (we are not illustrating this step here). Suppose we have conducted the trial and obtained the results shown in Table 2.
In this hypothetical example we obtained a result (mean difference of 0.5) that represents a clinically meaningful shift in this popula- It should also be noted that although RTM is an important motivating factor for understanding why efficacy is not estimable without a control, it is not the only motivating factor for using a control. For example, some single arm trial designs include simultaneous measurement of biomarkers and clinical outcomes that allow estimation of the correlation between biological changes and changes in the continuous efficacy outcome over time. Although this provides evidence that changes in biological activity are correlated with clinical outcome in treated patients, it does not provide any information as to how much more such biological activity or clinical outcome is modulated when the treatment was received as compared with when it was not (the treatment effect). Thus, in addition to serving as a solution to dealing with RTM, the use of a control group enables testing of scientific hypotheses that are not otherwise possible to test in a single arm trial.
The use of a concurrent, randomized control group remains the most accepted way to disentangle RTM from actual treatment effects.
However, novel statistical designs, such as Bayesian dynamic borrowing, have recently emerged that leverage historical control data, potentially minimizing the number of contemporaneous controls required for a study. 12 These designs may be beneficial for early phase cell therapy studies that incorporate efficacy assessments, especially in rare diseases in which the patient population is small, or in populations that express reluctance to consent to the possibility of assignment to a control.
In summary, this tutorial shows that interpretation of observed change in a continuous outcome measure as preliminary evidence of efficacy in the setting of an uncontrolled trial is likely to lead to erroneous conclusions. Therefore, evidence from single arm trials that assess efficacy through serial evaluation of continuous outcomes should not be used to make decisions about pursuing further research or to select treatments for patients.

CONFLICT OF INTEREST
The author indicated no potential conflicts of interest.

DATA AVAILABILITY STATEMENT
All the data generated or analyzed during this study are included in this published article.