SEARCH

SEARCH BY CITATION

Keywords:

  • pilot study;
  • randomization;
  • study design;
  • confidence interval

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

A well-designed pilot study can advance science by providing essential preliminary data to support or motivate further research, refine study logistics, and demonstrate proof of concept. Often, the outcomes of such studies can be quantified by a success/failure dichotomy. For example, a novel compound may show activation of a neural pathway, or it may not. When an intervention's efficacy is quantified using a dichotomous outcome, probability mass functions can be enumerated to determine the probability that the observed result from a pilot study supports further evaluation of the intervention since there is only a finite, and often small, number of sample configurations possible. The purpose of this research was to determine the probability of an “efficacy signal” for pilot studies using one- and two-sample pilot study designs. Efficacy signal was defined as the probability of observing a more favorable response proportion relative to a historical control (one-sample setting) or to a concurrent control (two-sample setting). An enumeration study (exact simulation) was conducted to calculate the efficacy signal probability. One-sample study designs yielded higher probability of determining an efficacy signal than the two-sample setting; however, sampling variation must be accounted for. A 68% score confidence interval is recommended for this purpose.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

In a clinical trial paradigm, the overarching objective of evaluating the safety and efficacy of an intervention, most commonly a new regimen, is divided into multiple phases of research. Traditionally, phase I trials have focused on estimating safety and the maximum tolerated dose for the drug. Phase II clinical trials provide data to support an initial estimate of drug's efficacy and further evaluation of toxicity. In a confirmatory phase III trial, the efficacy is evaluated in comparison to a concurrent control. Recently, this established paradigm has been questioned in light of time required to deliver a product to market and a lower than anticipated rate of successful phase III clinical trials.[1] Furthermore, a vigorous debate in the literature regarding the role of a randomized phase II trial in advancing the science has also been observed.[2-6]

In a related context, the role of a pilot study has received attention in the literature[7-14] and pilot studies are widely reported in the literature.[15] Compared to phase II clinical trials which routinely have 50–100 participants,[6] pilot studies tend to be much smaller. Julious[12] recommended studying 12 participants per group in a pilot study. This makes screening more than one drug simultaneously within a relatively small study feasible. When more than one group (or treatment, intervention, device, etc.) is to be considered in the context of a pilot study, most would agree that incorporating an element of randomization helps attenuate impacts of confounding and selection bias,[2] but an unanswered question in the literature is whether or not one should include more than one intervention in the context of a pilot study. This report will examine the probability structure of various pilot study designs and discuss how to best balance the need to have data to support further investigation while minimizing the effects of sample variation on the interpretation of the data. The research is illustrated in part by a recent randomized pilot study found in the medical literature.[16]

Illustrative example

Weiner et al.[16] recently published a randomized pilot trial consisting of 9 randomized participants, 8 of which were evaluable at the end of the trial. Participants, all of whom were clinically diagnosed with schizophrenia, were randomized to either varenicline or placebo to assess the safety and efficacy of varenicline for smoking cessation. The study's primary endpoint was cessation defined as expired CO<10 at each of the last 4 study visits. The observed cessation (success) proportions were 75% (3 of 4) and 0% (0 of 4) for varenicline and placebo, respectively (Fisher's exact p = 0.14). This analysis was supported by a mixed model analysis of the observed CO levels (reported p-value for treatment by time interaction was 0.02), and the final conclusion was that varenicline demonstrated a preliminary efficacy signal.

While this study concluded varenicline demonstrated preliminary efficacy, one might wonder what would happen if this same pilot study design was repeated several times. With the constraint that only four people were treated on each arm, there are only five unique cessation rates possible to be observed: 0%, 25%, 50%, 75%, and 100%, so the observed cessation rate may have little resemblance to the true proportion since the true cessation rate would lie somewhere on the continuum of 0–100%. For example, it is not expected that participants randomized to placebo will have a 0% cessation rate in a large study because supportive therapy to quit is commonly provided.[17] Thus, in order to design an effective pilot study using a binary endpoint, one must account for the discrete nature of the data along with simple binomial probability calculations. The following methodology addresses this need.

Materials and Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

The purpose of a many pilot studies is to provide an initial estimate of treatment effect before evaluating the approach more formally with a larger sample. The sample sizes are necessarily small for such studies. Pilot studies that utilize a dichotomous variable as the primary outcome, like the illustrative example, are routinely encountered and are the subject of this research. Therefore, the primary statistic of interest is the “success” proportion (binomial probability) and a successful pilot study is one that demonstrates a “signal” of efficacy. For the purpose of this paper, this initial “signal” is defined formally in the following section.

Efficacy signal criteria

One-sample approaches used in conjunction with a historical control

The design premise here is that there is an external benchmark that can be used to evaluate the relative merits of the novel approach. In some cases, this could be the success rate of a well-documented control condition that will likely be the control condition in a follow-up study. In other situations, there may be a variety of alternative approaches available. For the novel approach to be viewed as a viable alternative or to advance science in the particular field, a benchmark success rate can be defined based on a summarization of alternative approaches. For this approach, one might arbitrarily determine that the novel approach needs to be at least as successful as the median (or other percentile) of this collection of alternatives. Common to both of these scenarios is the need for the historical control or target success rate to be defined prior to the start of the experiment. Given this a priori definition, one may use the sample data to evaluate the treatment benefits compared to the historical control or target success rate based on point estimate or interval estimation. Three specific forms of this analysis strategy are detailed below.

Superior point estimate: An efficacy “signal” can be described as the realization of a sample estimate of the success proportion that exceeds the historical control. No consideration for sample variation is included in the decision to advance the approach to further experimentation. Note that while this approach appears to be ill-advised from a statistical point of view, the complement decision, namely if the observed success proportion is less than the historical control, very limited enthusiasm for continued evaluation is often present in the investigators and review panels.

Superior by 90% confidence interval: Testing at alpha = 0.10 is common in phase II studies and it has been recommended that an elevated type I error rate should be used with pilot studies.[14] With this approach, the 90% score (Wilson) confidence interval for the success proportion is estimated. Should the observed confidence interval exclude the historical control value along with the point estimate being in the favorable direction, an efficacy “signal” would be declared.

Superior by 68% confidence interval: Consistent with the prior approach, interval estimation is used in determining the efficacy “signal”; however, with this approach, a 68% confidence interval is suggested. The selection of this level of significance is based on the manner in which data are commonly summarized in the clinical and translational literature.[18] Conceptually, this interval would represent approximately ± 1 standard error (SE) from the mean, although with proportions and the asymmetrical score confidence interval, the ± 1 SE is a rough approximation.

Two-sample approaches

For the two-sample approaches, randomization is used to allocate either the novel approach or the anticipated control condition of the subsequent confirmatory trial to participants in a 1:1 manner. Alternative allocation ratios could be considered for this design, but for brevity, only 1:1 allocation is detailed as it represents the most common design in the literature.

Choose the Winner: This is the two-sample extension of the superior point estimate approach used for the one-sample design. In the two-sample setting, the treatment with highest success proportion is chosen for continued evaluation. Ideally, the chosen one will be the novel approach, but as outlined above, should the novel approach fail to achieve a superior success proportion, further experimentation is likely not to occur.

Chi-square test at alpha = 0.10 or 0.32: A Pearson chi-square test of proportions at either the alpha equal to 0.1 or 0.32 level of significance (to parallel the one-sample definitions above) can be used to evaluate the efficacy signal. Since the chi-square test is inherently two-sided, a further condition is that the novel approach has to have a higher success proportion than the control to be considered further.

Enumeration study

An enumeration study (exact simulation) was conducted to examine the performance of the six criteria identified to indicate if an approach has an efficacy signal. The enumeration approach differs from a traditional simulation in that the entire sample space is enumerated, the probability for each sample space element is calculated, and the exact probability of observing an efficacy signal is calculated. It is similar to a traditional simulation study in that distribution parameters are modified to provide a wide range of sample sizes and binomial success rates.

Sample sizes ranging from 5 to 20 were considered for the novel approach. For the two-sample designs, the total sample size ranged from 10 to 40 with equal allocation to the two groups (novel vs. standard). Sample sizes are expressed in figures as the number of participants per group. Binomial probabilities were allowed to range from 0.02 to 0.98, inclusive, in 0.02 units (Note: Binomial probabilities of 0.0 and 1.0 were excluded for computational reasons, for example, division by zero when determining relative risk). Relative risk (RR) was used quantify the effect size. For the one-sample design, RR was calculated assuming the historical binomial proportion was the reference probability.

For the one-sample setting at each binomial parameter value by sample size configuration, the probability mass function was enumerated. Indicator variables representing the “signal” definitions described above were used to determine the expected value (probability) for deeming a particular study as having an efficacy signal. In the two sample setting, the Cartesian product of the two sets of binomial proportions was used to determine the product-binomial probabilities so that a similar expected value calculation could be conducted using the two-sample efficacy signal definitions. The resulting expectations can be interpreted as an innovative power function for pilot studies.

Enumerated datasets were generated using The SAS System (Version 9.22; Cary, NC, USA). RR, confidence intervals and statistical tests were performed using direct programming in data steps and SAS MACROS. To aid in the interpretation of the data, the enumeration studies were stratified by RR and the relative efficacy of the novel approach. A specific set of results are presented for a sample size of 12. This sample size is based on recommendations from the literature.[12, 19]

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

A total of 38,416 unique study configurations (49 binomial probabilities, 2 groups, 16 sample sizes) were enumerated for this study. This resulted in population estimates of RR ranging from (49−1 to 49). Tables 1 and 2 illustrate how the data are to be summarized for the one-sample (superior point estimate) and two-sample (Choose the winner) definitions to determine if there is an efficacy signal. In both scenarios, the novel approach is assumed to have a binomial probability/success proportion of 0.4 and 0.2 is used for the control (historical or control). With this scenario, RR = 2.0 and one observes little difference in expected values (0.66 vs. 0.64). These expected values can be interpreted as follows:

Table 1. Calculations for a one-sample design with a sample size of 5 and a binomial probability of 0.40 for the novel intervention and 0.20 for the historical control. The efficacy indicator value has a value 1 when 2 or more successes are observed in a sample of size 5 (success proportions of 0.4, 0.6, 0.8, and 1.0). When the true binomial proportions are 0.4 and 0.2 with 5 participants, the expected value (or probability) for finding a “signal” based on a favorable point estimate is 0.66
Potential no. successSample success proportionBinomial probability (Pr)Efficacy indicator (I)Pr * I
000.077800.0000
10.20.259200.0000
20.40.345610.3456
30.60.230410.2304
40.80.076810.0768
510.010210.0102
   Expected value0.66304
Table 2. Two-sample design with the novel and control interventions having binomial probabilities of 0.40 and 0.20, respectively. The region of the table with italicized values indicates sample observations where the number of successes in the control condition is equal to or exceeds the number of successes in the novel approach. The sum of the bolded probabilities is the expected value for the “Choose the winner” approach. The expected value for this example is 0.64331
 Product Binomial Probabilities
No. Successes Concurrent Control (across) Novel (below)012345
00.0250.0320.0160.0040.0000.000
10.0850.1060.0530.0130.0020.000
20.1130.1420.0710.0180.0020.000
30.0750.0940.0470.0120.0010.000
40.0250.0310.0160.0040.0000.000
50.0030.0040.0020.0010.0000.000

One-sample setting: When the historical control has a 20% success rate and the novel approach has a 40% success rate, studying 5 participants will yield an efficacy signal defined as 2 or more successes 66% of the time. This also implies that 34% of the time the true effect will not be observed in a pilot study with only 5 participants.

Two-sample setting: When the concurrent control has a 20% success rate and the novel approach has a 40% success rate, studying 5 participants in each condition (10 total participants) will identify the efficacy signal defined as a higher success proportion in the novel approach 64% of the time.

Immediately one realizes that studying twice the number of participants in the two-sample setting did not improve this probability of observing the efficacy signal based on point estimates. In fact, the two-sample setting actually has lower probability to observe the fact that the novel approach was superior, despite the advantage of a larger sample size. This is a result of incorporating two sources of variation into the study design. The probability of observing a sample where the concurrent control appears more favorable is small but not nonexistent. The probabilities for such data realizations are included in the shaded region in Table 2. The remainder of the results expands up this example calculation to include the full range of comparison techniques and sample sizes.

Figure 1 illustrates the six candidate methods for evaluating the efficacy signal across 9 strata for RR. The probability to concluding an efficacy signal increases as the magnitude of association increases from stratum 1 (RR < 1/1.5) to stratum 9 (RR ≥ 3.0). For instances where the novel intervention is inferior to the control (historical or concurrent), the point estimate approach provides an unsatisfactorily high false positive rate. Incorporating statistical inference, even at the alpha = 0.32 level of significance provides control of the false positive findings. Further, it should be noted that the lines representing one-sample and two-sample designs are nearly coincident when the control condition is superior to the novel approach. However, the sample size required for the two-sample design is twice that of the one-sample design.

image

Figure 1. Probability for finding an efficacy signal for one- and two-sample study designs stratified across magnitude of effect quantified by relative risk. Probabilities are determined by an enumeration study using binomial and product-binomial probabilities.

Download figure to PowerPoint

RR Stratum 4 (RR = 1.0) presents data similar to the size of the test, but differs in that the there is a further condition in the efficacy signal that the point estimate has to be greater than the control. With this added condition, testing even at a highly elevated of alpha = 0.32 yields an probability of finding an efficacy signal in the range acceptable false positives for pilot studies, namely <25%.[14] It is for this reason and the parallel to the common method of summarizing translational studies (mean ± 1 SE), that the alpha = 0.32 standard will be used for further discussion.

The remainder of the panels in Figure 1 examine the probability of concluding an efficacy signal over the range of RRs in which the novel approach is superior to the control condition. When the RR is small, say less than 1.25, the one-sample and two-sample designs using alpha = 0.32 yield very similar findings. As before, the one-sample designs provide better efficiency through a lower total sample size. As the magnitude of the effect increases, the probability of concluding an efficacy signal is markedly higher using a one-sample design. Again, this increase in probability is in spite of only utilizing half the total number of participants.

Figure 1, while informative from a statistical performance perspective, has limited utility in practice since the true underlying binomial probabilities are unknown. Table 3 addresses this concern by aggregating the data over the risk strata for three hypothesized regions of expected novel treatment efficacy. The three categories represent low probability of success (binomial success probabilities < 0.3), average probability of success ([0.3, 0.7]), and a high probability of success (>0.7). To allow for various control response rates, the 9 risk strata used in Figure 1 are used. The use of the table is straight forward as will be illustrated.

Table 3. Expected power based on n = 12 observations for detecting an “efficacy signal” for various magnitudes of effects. Values reported are the median of the expected values for the probability of determining a novel intervention as having an “efficacy signal” based on a one-sample 68% confidence interval approach. Calculations are grouped into three categorizations of efficacy for a novel treatment
 Expected response proportion for participants treated with the novel treatment
Relative risk range[0.02, 0.30)[0.30, 0.70](0.70, 0.98]
<1/1.50.0020.001
[1/1.5, 1/1.25)0.0580.0170.005
[1/1.25, 1.0)0.1040.0850.060
1.0 (Null)0.1660.1710.162
(1.0, 1.25)0.2180.3010.465
[1.25, 1.5)0.3310.5090.808
[1.5, 2.0)0.4580.7100.952
[2.0, 3.0)0.5800.8710.994
≥3.00.7880.9791.000

Suppose that a novel treatment is expected to have an efficacy in the range of 0.3–0.7 and that for the treatment to considered a candidate for further study, it will need to have at least a 100% improvement over the control condition (suppose a placebo control, RR ≥ 2.0). Using Table 3, with 12 participants treated with the novel treatment, the median probability of determining an efficacy signal is 87%. Thus, one has a high probability of detecting the efficacy signal with a one-sample approach, a 68% confidence interval and only 12 observations.

Another way of utilizing Table 3 is to determine when more than 12 participants are needed in a pilot study. For example, if you are studying a novel treatment with low success probability, the control condition will have to have a success probability less than one-third that of the novel treatment to have a probability of finding the efficacy signal that is marginal (say approximately 80%). These calculations support reservations regarding pilot studies that study rare events or interventions that have limited improvement over the control condition.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

This study found that effective treatment screening can be accomplished using a one-sample design and a relatively small number of participants. In order to provide adequate control over the false positive rate, a confidence interval should be included and that the approach that only requires the pilot study provide a favorable point estimate should not be used. The enumeration study supported the use of alpha = 0.32, which roughly translates into ±1 standard error about the mean. As few as 12 participants[12] yielded an adequate sample size provided the expected magnitude of the effect is large (RR > 2.0) and the novel intervention is expected to have a success probability of at least 30%. Larger sample sizes would be required otherwise.

It has been discussed that randomization does provide additional benefits to the study design by minimizing bias,[2] but in the pilot study context, randomization of an active treatment to a control in a manner that would be consistent with a confirmatory trial setting may be ill advised. The motivating example[16] was able to determine an efficacy signal through a randomized controlled pilot study, but a larger number of patients treated with varenicline would have yielded additional and more precise information about the safety and efficacy of the varenicline in people with schizophrenia. Randomization used in the context of treatment screening, however, may be a useful use of resources. In this context, several novel interventions could be tested by randomizing patients to one, or more in the case of cross over designs, intervention. The important distinction is that the use of a historical control would still be recommended and that the study would consist of multiple one-sample tests relative to the historical control. Testing for superiority between novel interventions would be ill advised in a pilot study setting where there may be virtually no power to determine efficacy between novel interventions. In a related context, it is worth noting that the enumeration study's design did not consider the traditional 5% level of significance. The rationale for such an exclusion is based on recommendations stemming from literature on pilot studies.[14, 15] Furthermore, testing at an elevated significance level increases power to detect the efficacy signal.

A further consideration in these designs is the primary outcome for the study. A traditional efficacy outcome could be considered in cases like our motivating example. However, this need not be the case. The National Institutes of Mental Health's Research Domain Criteria (RDoc)[20, 21] strives to expand the diagnosis and treatment of mental health disorders by incorporating a panel of biosignatures into the research paradigm. The activation of a neural circuit, or alteration of the neural signatures in response to a treatment, may be one of the hallmarks for demonstrating biological plausibility of an intervention. Early phase trials thus need to be designed to detect such activation without regards to more traditional efficacy outcomes. The paradigm described in this enumeration study could be one approach to addressing this design challenge.

A limitation of the one-sample approach is the reliance on the availability of data to define the historical control proportion. Often, one can estimate a plausible range of responses for a placebo intervention based on the natural course of the disease or extrapolations from other, yet similar, conditions. For the pilot study design, one can use the general framework presented in Table 3 to further attenuate the limitation of the lack of a precise value for the historical control. Should the concern regarding the uncertainty surrounding the historical control be such that a concurrent control would be necessary, two-sample designs are certainly a viable option. This research suggests that when a historical control can be used, a savings of approximately half of the required participants can be accomplished while providing comparable statistical performance. This will accelerate the pilot phase and enable the research to move more quickly to confirmatory testing.

There are instances, however, where the historical control rate is truly uncertain in the population understudy, and one of the aims of the pilot study could be to estimate the effect of the control group as if it was a novel intervention. For example, in settings such as the smoking cessation illustrative study, one would often incorporate behavioral therapy components,[22] and by doing so, the control group response rate may need estimation. Here a randomized design may help eliminate biases and provide preliminary data on both interventions' efficacy signal.

In summary, a pilot study setting where the sample size is generally much less than a formal phase II clinical trial, there is statistical support for focusing on one-sample designs even in instances where the historical control response rate is not well known.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References

This publication was partially supported by UL1 TR000135 from the National Center for Advancing Translational Sciences (NCATS) and R01 DK 085516 from the National Institute of Diabetes and Digestive and Kidney Diseases. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References