### Abstract

- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results
- Discussion
- Acknowledgments
- References

A well-designed pilot study can advance science by providing essential preliminary data to support or motivate further research, refine study logistics, and demonstrate proof of concept. Often, the outcomes of such studies can be quantified by a success/failure dichotomy. For example, a novel compound may show activation of a neural pathway, or it may not. When an intervention's efficacy is quantified using a dichotomous outcome, probability mass functions can be enumerated to determine the probability that the observed result from a pilot study supports further evaluation of the intervention since there is only a finite, and often small, number of sample configurations possible. The purpose of this research was to determine the probability of an “efficacy signal” for pilot studies using one- and two-sample pilot study designs. Efficacy signal was defined as the probability of observing a more favorable response proportion relative to a historical control (one-sample setting) or to a concurrent control (two-sample setting). An enumeration study (exact simulation) was conducted to calculate the efficacy signal probability. One-sample study designs yielded higher probability of determining an efficacy signal than the two-sample setting; however, sampling variation must be accounted for. A 68% score confidence interval is recommended for this purpose.

### Introduction

- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results
- Discussion
- Acknowledgments
- References

In a clinical trial paradigm, the overarching objective of evaluating the safety and efficacy of an intervention, most commonly a new regimen, is divided into multiple phases of research. Traditionally, phase I trials have focused on estimating safety and the maximum tolerated dose for the drug. Phase II clinical trials provide data to support an initial estimate of drug's efficacy and further evaluation of toxicity. In a confirmatory phase III trial, the efficacy is evaluated in comparison to a concurrent control. Recently, this established paradigm has been questioned in light of time required to deliver a product to market and a lower than anticipated rate of successful phase III clinical trials.[1] Furthermore, a vigorous debate in the literature regarding the role of a randomized phase II trial in advancing the science has also been observed.[2-6]

In a related context, the role of a pilot study has received attention in the literature[7-14] and pilot studies are widely reported in the literature.[15] Compared to phase II clinical trials which routinely have 50–100 participants,[6] pilot studies tend to be much smaller. Julious[12] recommended studying 12 participants per group in a pilot study. This makes screening more than one drug simultaneously within a relatively small study feasible. When more than one group (or treatment, intervention, device, etc.) is to be considered in the context of a pilot study, most would agree that incorporating an element of randomization helps attenuate impacts of confounding and selection bias,[2] but an unanswered question in the literature is whether or not one *should* include more than one intervention in the context of a pilot study. This report will examine the probability structure of various pilot study designs and discuss how to best balance the need to have data to support further investigation while minimizing the effects of sample variation on the interpretation of the data. The research is illustrated in part by a recent randomized pilot study found in the medical literature.[16]

#### Illustrative example

Weiner et al.[16] recently published a randomized pilot trial consisting of 9 randomized participants, 8 of which were evaluable at the end of the trial. Participants, all of whom were clinically diagnosed with schizophrenia, were randomized to either varenicline or placebo to assess the safety and efficacy of varenicline for smoking cessation. The study's primary endpoint was cessation defined as expired CO<10 at each of the last 4 study visits. The observed cessation (success) proportions were 75% (3 of 4) and 0% (0 of 4) for varenicline and placebo, respectively (Fisher's exact *p* = 0.14). This analysis was supported by a mixed model analysis of the observed CO levels (reported *p*-value for treatment by time interaction was 0.02), and the final conclusion was that varenicline demonstrated a preliminary efficacy signal.

While this study concluded varenicline demonstrated preliminary efficacy, one might wonder what would happen if this same pilot study design was repeated several times. With the constraint that only four people were treated on each arm, there are only five unique cessation rates possible to be observed: 0%, 25%, 50%, 75%, and 100%, so the observed cessation rate may have little resemblance to the true proportion since the true cessation rate would lie somewhere on the continuum of 0–100%. For example, it is not expected that participants randomized to placebo will have a 0% cessation rate in a large study because supportive therapy to quit is commonly provided.[17] Thus, in order to design an effective pilot study using a binary endpoint, one must account for the discrete nature of the data along with simple binomial probability calculations. The following methodology addresses this need.

### Results

- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results
- Discussion
- Acknowledgments
- References

A total of 38,416 unique study configurations (49 binomial probabilities, 2 groups, 16 sample sizes) were enumerated for this study. This resulted in population estimates of RR ranging from (49^{−1} to 49). Tables 1 and 2 illustrate how the data are to be summarized for the one-sample (superior point estimate) and two-sample (Choose the winner) definitions to determine if there is an efficacy signal. In both scenarios, the novel approach is assumed to have a binomial probability/success proportion of 0.4 and 0.2 is used for the control (historical or control). With this scenario, RR = 2.0 and one observes little difference in expected values (0.66 vs. 0.64). These expected values can be interpreted as follows:

Table 1. Calculations for a one-sample design with a sample size of 5 and a binomial probability of 0.40 for the novel intervention and 0.20 for the historical control. The efficacy indicator value has a value 1 when 2 or more successes are observed in a sample of size 5 (success proportions of 0.4, 0.6, 0.8, and 1.0). When the true binomial proportions are 0.4 and 0.2 with 5 participants, the expected value (or probability) for finding a “signal” based on a favorable point estimate is 0.66Potential no. success | Sample success proportion | Binomial probability (Pr) | Efficacy indicator (I) | Pr * I |
---|

0 | 0 | 0.0778 | 0 | 0.0000 |

1 | 0.2 | 0.2592 | 0 | 0.0000 |

2 | 0.4 | 0.3456 | 1 | 0.3456 |

3 | 0.6 | 0.2304 | 1 | 0.2304 |

4 | 0.8 | 0.0768 | 1 | 0.0768 |

5 | 1 | 0.0102 | 1 | 0.0102 |

| | | Expected value | 0.66304 |

Table 2. Two-sample design with the novel and control interventions having binomial probabilities of 0.40 and 0.20, respectively. The region of the table with italicized values indicates sample observations where the number of successes in the control condition is equal to or exceeds the number of successes in the novel approach. The sum of the bolded probabilities is the expected value for the “Choose the winner” approach. The expected value for this example is 0.64331 | Product Binomial Probabilities |
---|

No. Successes Concurrent Control (across) Novel (below) | 0 | 1 | 2 | 3 | 4 | 5 |
---|

0 | *0.025* | *0.032* | *0.016* | *0.004* | *0.000* | *0.000* |

1 | **0.085** | *0.106* | *0.053* | *0.013* | *0.002* | *0.000* |

2 | **0.113** | **0.142** | *0.071* | *0.018* | *0.002* | *0.000* |

3 | **0.075** | **0.094** | **0.047** | *0.012* | *0.001* | *0.000* |

4 | **0.025** | **0.031** | **0.016** | **0.004** | *0.000* | *0.000* |

5 | **0.003** | **0.004** | **0.002** | **0.001** | **0.000** | *0.000* |

*One-sample setting*: When the historical control has a 20% success rate and the novel approach has a 40% success rate, studying 5 participants will yield an efficacy signal defined as 2 or more successes 66% of the time. This also implies that 34% of the time the true effect will not be observed in a pilot study with only 5 participants.

*Two-sample setting*: When the concurrent control has a 20% success rate and the novel approach has a 40% success rate, studying 5 participants in each condition (10 total participants) will identify the efficacy signal defined as a higher success proportion in the novel approach 64% of the time.

Immediately one realizes that studying twice the number of participants in the two-sample setting did not improve this probability of observing the efficacy signal based on point estimates. In fact, the two-sample setting actually has lower probability to observe the fact that the novel approach was superior, despite the advantage of a larger sample size. This is a result of incorporating two sources of variation into the study design. The probability of observing a sample where the concurrent control appears more favorable is small but not nonexistent. The probabilities for such data realizations are included in the shaded region in Table 2. The remainder of the results expands up this example calculation to include the full range of comparison techniques and sample sizes.

*Figure* 1 illustrates the six candidate methods for evaluating the efficacy signal across 9 strata for RR. The probability to concluding an efficacy signal increases as the magnitude of association increases from stratum 1 (RR < 1/1.5) to stratum 9 (RR ≥ 3.0). For instances where the novel intervention is inferior to the control (historical or concurrent), the point estimate approach provides an unsatisfactorily high false positive rate. Incorporating statistical inference, even at the alpha = 0.32 level of significance provides control of the false positive findings. Further, it should be noted that the lines representing one-sample and two-sample designs are nearly coincident when the control condition is superior to the novel approach. However, the sample size required for the two-sample design is twice that of the one-sample design.

RR Stratum 4 (RR = 1.0) presents data similar to the size of the test, but differs in that the there is a further condition in the efficacy signal that the point estimate has to be greater than the control. With this added condition, testing even at a highly elevated of alpha = 0.32 yields an probability of finding an efficacy signal in the range acceptable false positives for pilot studies, namely <25%.[14] It is for this reason and the parallel to the common method of summarizing translational studies (mean ± 1 SE), that the alpha = 0.32 standard will be used for further discussion.

The remainder of the panels in *Figure* 1 examine the probability of concluding an efficacy signal over the range of RRs in which the novel approach is superior to the control condition. When the RR is small, say less than 1.25, the one-sample and two-sample designs using alpha = 0.32 yield very similar findings. As before, the one-sample designs provide better efficiency through a lower total sample size. As the magnitude of the effect increases, the probability of concluding an efficacy signal is markedly higher using a one-sample design. Again, this increase in probability is in spite of only utilizing half the total number of participants.

*Figure* 1, while informative from a statistical performance perspective, has limited utility in practice since the true underlying binomial probabilities are unknown. Table 3 addresses this concern by aggregating the data over the risk strata for three hypothesized regions of expected novel treatment efficacy. The three categories represent low probability of success (binomial success probabilities < 0.3), average probability of success ([0.3, 0.7]), and a high probability of success (>0.7). To allow for various control response rates, the 9 risk strata used in *Figure* 1 are used. The use of the table is straight forward as will be illustrated.

Table 3. Expected power based on *n* = 12 observations for detecting an “efficacy signal” for various magnitudes of effects. Values reported are the median of the expected values for the probability of determining a novel intervention as having an “efficacy signal” based on a one-sample 68% confidence interval approach. Calculations are grouped into three categorizations of efficacy for a novel treatment | Expected response proportion for participants treated with the novel treatment |
---|

Relative risk range | [0.02, 0.30) | [0.30, 0.70] | (0.70, 0.98] |
---|

<1/1.5 | 0.002 | 0.001 | – |

[1/1.5, 1/1.25) | 0.058 | 0.017 | 0.005 |

[1/1.25, 1.0) | 0.104 | 0.085 | 0.060 |

1.0 (Null) | 0.166 | 0.171 | 0.162 |

(1.0, 1.25) | 0.218 | 0.301 | 0.465 |

[1.25, 1.5) | 0.331 | 0.509 | 0.808 |

[1.5, 2.0) | 0.458 | 0.710 | 0.952 |

[2.0, 3.0) | 0.580 | 0.871 | 0.994 |

≥3.0 | 0.788 | 0.979 | 1.000 |

Suppose that a novel treatment is expected to have an efficacy in the range of 0.3–0.7 and that for the treatment to considered a candidate for further study, it will need to have at least a 100% improvement over the control condition (suppose a placebo control, RR ≥ 2.0). Using Table 3, with 12 participants treated with the novel treatment, the median probability of determining an efficacy signal is 87%. Thus, one has a high probability of detecting the efficacy signal with a one-sample approach, a 68% confidence interval and only 12 observations.

Another way of utilizing Table 3 is to determine when more than 12 participants are needed in a pilot study. For example, if you are studying a novel treatment with low success probability, the control condition will have to have a success probability less than one-third that of the novel treatment to have a probability of finding the efficacy signal that is marginal (say approximately 80%). These calculations support reservations regarding pilot studies that study rare events or interventions that have limited improvement over the control condition.

### Discussion

- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results
- Discussion
- Acknowledgments
- References

This study found that effective treatment screening can be accomplished using a one-sample design and a relatively small number of participants. In order to provide adequate control over the false positive rate, a confidence interval should be included and that the approach that only requires the pilot study provide a favorable point estimate should not be used. The enumeration study supported the use of alpha = 0.32, which roughly translates into ±1 standard error about the mean. As few as 12 participants[12] yielded an adequate sample size provided the expected magnitude of the effect is large (RR > 2.0) and the novel intervention is expected to have a success probability of at least 30%. Larger sample sizes would be required otherwise.

It has been discussed that randomization does provide additional benefits to the study design by minimizing bias,[2] but in the pilot study context, randomization of an active treatment to a control in a manner that would be consistent with a confirmatory trial setting may be ill advised. The motivating example[16] was able to determine an efficacy signal through a randomized controlled pilot study, but a larger number of patients treated with varenicline would have yielded additional and more precise information about the safety and efficacy of the varenicline in people with schizophrenia. Randomization used in the context of treatment screening, however, may be a useful use of resources. In this context, several novel interventions could be tested by randomizing patients to one, or more in the case of cross over designs, intervention. The important distinction is that the use of a historical control would still be recommended and that the study would consist of multiple one-sample tests relative to the historical control. Testing for superiority between novel interventions would be ill advised in a pilot study setting where there may be virtually no power to determine efficacy between novel interventions. In a related context, it is worth noting that the enumeration study's design did not consider the traditional 5% level of significance. The rationale for such an exclusion is based on recommendations stemming from literature on pilot studies.[14, 15] Furthermore, testing at an elevated significance level increases power to detect the efficacy signal.

A further consideration in these designs is the primary outcome for the study. A traditional efficacy outcome could be considered in cases like our motivating example. However, this need not be the case. The National Institutes of Mental Health's Research Domain Criteria (RDoc)[20, 21] strives to expand the diagnosis and treatment of mental health disorders by incorporating a panel of biosignatures into the research paradigm. The activation of a neural circuit, or alteration of the neural signatures in response to a treatment, may be one of the hallmarks for demonstrating biological plausibility of an intervention. Early phase trials thus need to be designed to detect such activation without regards to more traditional efficacy outcomes. The paradigm described in this enumeration study could be one approach to addressing this design challenge.

A limitation of the one-sample approach is the reliance on the availability of data to define the historical control proportion. Often, one can estimate a plausible range of responses for a placebo intervention based on the natural course of the disease or extrapolations from other, yet similar, conditions. For the pilot study design, one can use the general framework presented in Table 3 to further attenuate the limitation of the lack of a precise value for the historical control. Should the concern regarding the uncertainty surrounding the historical control be such that a concurrent control would be necessary, two-sample designs are certainly a viable option. This research suggests that when a historical control can be used, a savings of approximately half of the required participants can be accomplished while providing comparable statistical performance. This will accelerate the pilot phase and enable the research to move more quickly to confirmatory testing.

There are instances, however, where the historical control rate is truly uncertain in the population understudy, and one of the aims of the pilot study could be to estimate the effect of the control group as if it was a novel intervention. For example, in settings such as the smoking cessation illustrative study, one would often incorporate behavioral therapy components,[22] and by doing so, the control group response rate may need estimation. Here a randomized design may help eliminate biases and provide preliminary data on both interventions' efficacy signal.

In summary, a pilot study setting where the sample size is generally much less than a formal phase II clinical trial, there is statistical support for focusing on one-sample designs even in instances where the historical control response rate is not well known.