Evidence-Based Medicine: The Design and Interpretation of Noninferiority Clinical Trials in Veterinary Medicine

Authors


Abstract

Noninferiority trials are clinical studies designed to demonstrate that an investigational drug is at least as effective as an established treatment within a predetermined margin. They are conducted, in part, because of ethical concerns of administering a placebo to veterinary patients when an established effective treatment exists. The use of noninferiority trial designs has become more common in veterinary medicine with the increasing number of established veterinary therapeutics and the desire to eliminate potential pain or distress in a placebo-controlled study. Selecting the appropriate active control and an a priori noninferiority margin between the investigational and active control drug are unique and critical design factors for noninferiority studies. Without reliable historical knowledge of the disease response in the absence of treatment and of the response to the selected active control drug, proper design and interpretation of a noninferiority trial is not possible. Despite the appeal of conducting noninferiority trials to eliminate ethical concerns of placebo-controlled studies, there are real limitations and possible ethical conundrums associated with noninferiority trials. The consequences of incorrect study conclusions because of poor noninferiority trial design need careful attention. Alternative trial designs to typical noninferiority studies exist, but these too have limitations and must also be carefully considered.

Abbreviations
CI

confidence interval

EMA

European Medicines Agency

FDA

Food and Drug Administration

H0

null hypothesis

HA

alternative hypothesis

ITT

intent-to-treat

NSAIDS

nonsteroidal anti-inflammatory drugs

PP

per-protocol

Clinical trials conducted in subjects with naturally occurring disease are not only considered the gold standard to demonstrate that drugs are safe and effective but also are required to achieve drug approval by both the Food and Drug Administration (FDA) and European Medicines Agency (EMA). In the broadest sense, there are 3 types of trial designs: superiority, equivalence, and noninferiority studies. Superiority studies seek to establish effectiveness by demonstrating an outcome superior to a control product, typically a placebo. Equivalence studies are designed to confirm that there is no clinically relevant difference between treatments. Noninferiority studies are designed to demonstrate that an investigational drug is at least as effective, within an acceptably small margin, as an established, well-studied treatment.

There are several medical, legal, statistical, and ethical factors that influence design selection. A placebo-controlled superiority trial often is pursued when there is no well-studied standard of care. Moreover, when a new drug is being developed for regulatory approval, a placebo control generally is considered ideal unless an existing standard of care, used as an active control, is approved for the species and indication for which the investigational drug is being examined. However, active controlled superiority trials generally are not pursued because it often is not logical that a new treatment will be superior to an old treatment or the expected improvement in treatment is small and thus an appropriately powered study would be prohibitively large. Equivalence trials are designed to demonstrate that the effectiveness of an investigational drug is similar to that of an active control, within a predetermined indifference zone,[1] and most commonly are used for pharmacokinetic bioequivalence determination. Noninferiority trials sometimes are conducted solely because of the ethical concerns of administering a placebo in a study when an established effective treatment exists.[2, 3] Another reason to conduct a noninferiority trial is to determine the comparative safety and effectiveness of 2 products.[2] Even though an effective treatment already exists, a new treatment that is no better than the current treatment may be desirable because of improved safety, decreased cost, improved compliance, or ease of administration.[4]

There have been a number of recently FDA- or EMA-approved drugs for which sponsors have conducted noninferiority trials to demonstrate safety and effectiveness (Table 1). The first noninferiority studies in veterinary medicine were not reported until 2003, and have become more common in the peer-reviewed literature over the past few years. This trend toward noninferiority trials in veterinary medicine may be caused by ethical concerns with placebo-controlled studies. Informed consent cannot be gained from the subject under study (eg, the dog), nor can the subject choose to withdraw of its own will, and therefore trial designs often err on the side of caution. In the sense that informed consent is not obtained from the patient but from a legal guardian or owner, clinical trials in veterinary medicine share a unique commonality with pediatric trials in human medicine. Conducting negatively controlled clinical trials of analgesics in dogs and cats is especially difficult, because of ethical concerns of knowingly withholding pain medication coupled with the greater availability of analgesics in these species,[5] and in fact analgesic studies in dogs and cats were the most commonly identified veterinary noninferiority trial (Table 1). Also, similar to the situation in human pediatric medicine, there are many major diseases in veterinary medicine for which a standard of care exists, but no effect size has been established compared with a placebo. As discussed below, without appropriate placebo-controlled studies, there is insufficient information to justify the use of the established treatment as an active control in a noninferiority trial.[3]

Table 1. Recent noninferiority studies in veterinary medicine.a
Year of PublicationSpeciesTherapeutic ClassPrimary EndpointA Priori Selected Margin (Justification)Noninferiority MarginComments
  1. NA, not applicable; CI, confidence interval.

  2. a

    Includes published, peer-reviewed studies, and studies reported in FDA freedom of information summaries.

  3. b

    Peer-reviewed publication year.

2003[32]PigAnti-inflammatoryClinical index scoreYes (no justification)1.4 score point 
2004[33]CatAnalgesicPain intervention rateNoNA 
2004[34]DogAntibioticCure rateYes (no justification)15% 
2005[35]DogParasiticideCure rateYes (clinical judgment)20% 
2005[36]DogParasiticideCure rateYes (clinical judgment)20% 
2005[37]DogVaccineGeometric mean virus titersNoNAUsed superiority analysis and concluded noninferiority based on no significant difference
2005[38]HorseAnalgesicClinical improvementYes (no justification)15% 
2006b,[39, 40]DogAnalgesicIncidence of clinical improvementYes (no justification)15% 
2007[41]CatAnthelminticSuccess rateYes (no justification)15% 
2007[42]CatSedative and AnalgesicSuccess rateYes (no justification)13% 
2007[43]CattleDisinfectantInfection rateNo, because of lack of historical studyNA 
2007[44]DogAntibioticSuccess rateYes (clinical judgment)15% 
2007[45]DogAntihypertensiveSuccess rateYes (no justification)20% 
2007[46]DogInotrope and VasodilatorSuccess rateYes (no justification)Not reported 
2008[47]CatAntibioticSuccess rateYes (clinical judgment)15% 
2008[48]CatAntibioticSuccess rateYes (no justification)15% 
2008[48]DogAntibioticSuccess rateYes (no justification)15% 
2009[49]CattleAntibioticSuccess rateYes (no justification)15% 
2009[50]DogAnthelminticLog transformed egg countYes (no justification).38 
2009[51]DogAntibiotic and AntifungalImprovement rateYes (no justification)10% 
2010[52]CatAnalgesicAssessment score ratioYes (no justification).2 
2010[53]CattleAntibioticNew infection rate and cure rateYes (clinical judgment)3 and 10% 
2010[54]DogAntifungalYeast count and scoreYes (clinical judgment)20% of control 
2011[55]CatAntianginalCardiac functionsYes (clinical and statistical judgments based on random effects)50% of control 
2011[25]DogAnalgesicGlasgow Composite Pain Scale RatioYes (no justification).2 
2011[56]DogAnthelminticLog transformed egg countYes (no justification).2 
2012[57]CatAnalgesicTotal clinician score ratioYes (based on a previous human study).25 
2012[58]CatAnalgesicTotal clinician score ratioYes (based on a previous human study).25 
2012[59]DogAnalgesicFailure rateYes (no justification)15% 
2012[60]DogAnalgesicFailure rateYes (based on a previous study of different control)15% 
2012[61]DogAnalgesicGlobal functional disability score ratioYes (based on a previous human study).25Lower CI bound of ratio <.75 (point estimate 1.244), but noninferiority still concluded
2012[62]DogAnalgesicRating scale ratioYes (no justification).2 
2012[63]DogAntibioticClinical cure rateYes (no justification)20% 
2013[64]CattleAntibioticCure rateYes (no justification)10% 
2013[65]DogAnalgesicGlasgow Composite Pain Scale RatioYes (no justification).2 
2013[66]DogAnalgesicSuccess rateYes (clinical judgment)20% 

Although there are many study design differences between noninferiority and “classical” negatively controlled superiority trials, the most important of these differences in a noninferiority study are as follows: (1) switching the hypotheses; (2) selecting the appropriate active control; (3) determining the noninferiority margin between the active control and investigational drug; and (4) the appropriate interpretation of the study results. In human medicine, there are a number of national and international regulatory guidance documents on the appropriate conduct of noninferiority trials[2, 6, 7] and just recently, the FDA released guidance for veterinary medicine.[8] Despite this growing body of studies and guidance, most documents fail to provide any specific guidelines on how to choose the active control and how to determine the noninferiority margin.[2, 6, 9] Because of the increased use of noninferiority trials in veterinary medicine and the unique design features of these trials, the objective of this review is to outline the key design aspects and proper interpretation of noninferiority trials.

Design of Noninferiority Trials

Switching the Hypotheses

In a superiority trial, the null (H0) and alternative (HA) hypotheses about the effect of a new test treatment (μT) relative to the active control treatment (μA) are the familiar:

display math(1)

If H0 is rejected based on the study results, it is concluded that treatment has a significant effect. However, if the study fails to reject H0, the conclusion is not that the treatments are equal, but that there is insufficient evidence to conclude that H0 is not true. In other words, failing to reject H0 does not result in the conclusion that H0 is true.[10] If it were, one could conclude that 2 treatments are the same by failing to reject the null hypothesis, an erroneous conclusion reached in some small studies in veterinary medicine. To test if there is no difference in the treatments, H0 and HA must be reversed to:

display math(2)

With the hypotheses reversed, rejection of the null hypothesis results in the conclusion that there is no difference between the treatments. However, demonstrating that the 2 treatments are identical by showing the difference in their outcome effect is zero is not possible nor is it usually of clinical interest. What is usually of interest is determining if the difference in treatments is not too much different from zero. More relevant to a noninferiority trial, the clinical interest is determining that the test treatment is at least as effective as the active control treatment. That is, the investigational drug is considered noninferior to an active control within a predetermined acceptable difference. This allowed difference between active control and test responses is referred to as noninferiority margin or margin of indifference (δ). In the case where a larger effect indicates better effectiveness of a treatment (eg, cure rate), the noninferiority trial null and alternative hypotheses become:

display math(3)

And in the case where a smaller effect indicates better effectiveness (eg, mortality rate), the hypotheses become:

display math(4)

In either case, if H0 is rejected based on the study results, it is concluded the test treatment is noninferior to the active control treatment.

Another common set-up of hypotheses for noninferiority trials is the comparison of ratios (eg, risk, odds, or hazard). For simplicity of this review, only absolute treatment differences will be discussed. However, the design and interpretation of noninferiority trials with the hypotheses set up as ratios can be logically extended from the discussion below.

Type I and Type II Errors

Although the hypotheses are reversed, as described above, the definitions of Type I and Type II errors remain the same. Only the interpretations are different from those under the conventional hypotheses (Equation (1)). Under the noninferiority hypotheses, the Type I error is to wrongly conclude that the investigational drug is noninferior to the active control, whereas it is actually inferior to the active control. Conversely, the Type II error is to wrongly conclude that the investigational drug is inferior to the active control, whereas it actually is noninferior.

Sample Size Considerations

It is common that the sample size of a noninferiority trial should be of the same order of magnitude as that of a superiority trial.[11] If the sample size calculation in a noninferiority trial is performed under the alternative hypothesis that the investigational drug is as effective as the active control (ie, μT − μA = 0), then the noninferiority margin is analogous for powering purposes to the assumed difference between treatments under the alternative hypothesis in a superiority trial. Thus, the smaller the margin is, the larger the sample size that is needed. Nevertheless, it has been shown that the sample size may greatly increase if the investigational drug is actually slightly less effective than the active control.[10]

Assay Sensitivity

An essential property of a noninferiority trial is assay sensitivity.[2, 6-9] Assay sensitivity is the ability of a trial to distinguish an effective treatment from a less effective or ineffective treatment. A successful superiority trial, in which the new treatment is determined to be significantly better than the negative or positive control, has inherent assay sensitivity. In contrast, if the hypothesis test in a superiority trial fails to reject the null hypothesis of equal treatment effects, it could be because the new treatment is truly ineffective or because the trial lacked assay sensitivity. However, even if a noninferiority trial is successful and the new treatment is determined to be noninferior to the active control, it could be that (1) both treatments are similarly effective, (2) both treatments are similarly ineffective, or (3) the trial lacks assay sensitivity. In other words, the assay sensitivity of the trial is unknown based on the results of the study alone. In a noninferiority study, the assay sensitivity of the trial is assessed from historical evidence of sensitivity of a trial design to the active control treatment's effects, based on 1 or more well-designed, previously conducted study of the active control compared with a placebo. In the absence of a placebo-controlled superiority trial with sensitivity, at minimum, there needs to be rigorous historical knowledge of the course of the disease in the absence of treatment and (separate) knowledge of the disease response when the active control is administered. Primarily because of the elimination of the concern of assay sensitivity within the trial, placebo-controlled superiority studies generally are preferred over noninferiority studies when ethically appropriate because they more conclusively and less subjectively demonstrate a treatment's effectiveness.[12]

There are several particularly important design features that must be followed to gain confidence that the planned noninferiority trial maintains assay sensitivity. The first is that the trial must utilize a similar treatment regimen as the prior trial(s) of the active control against the placebo. For example, data from a single-dose study of the active control would not be supportive of a planned noninferiority trial examining multiple doses over time. Second, the planned noninferiority trial also must study a similar patient population as the placebo-controlled trial of the active control. If the active control was studied in a population of patients with mild disease, this would not be supportive in the planned noninferiority study in a population of patients with severe disease (Fig 1). Third, the selected primary outcome measure and the time of evaluation in the noninferiority trial must be an endpoint measured and reported in a placebo-controlled trial of the selected active control because this is the outcome that has been shown to demonstrate sensitivity to treatment effects. For example, if the active control effect size was measured with a visual analog scale, it would be problematic to assume that a noninferiority study using a Glasgow-modified pain scale (Fig 2) would be equally sensitive.

Figure 1.

The effect that changes in disease severity enrollment criteria (newly diagnosed versus established disease) may have on the response of the active control. If the distribution of the disease shifts to greater severity with different enrollment criteria, the effectiveness of the active control compared with the placebo may be reduced.

Figure 2.

The impact that changing the clinical endpoint from one scale to another may have on the effect of the active control. The differences between the active control and placebo cannot be transformed from the visual analog scale to the Glasgow-Modified Pain Scale.

The study quality of the trial(s) conducted with the selected active control compared with the placebo is also important. A high-quality, well-described trial demonstrating the effectiveness of the active control ensures a high degree of confidence that the effect is reproducible (ie, assay sensitivity) and that trial design can be recreated. If a single low-quality study of the active control is all that exists, the strength of evidence from the noninferiority trial will be undermined, no matter how well the noninferiority trial is conducted.

Constancy Assumption

Implicit in the choice of active control treatment is the assumption that the difference between the active control and the placebo is constant from study to study (Fig 3). This constancy assumption is needed, so that the appropriate noninferiority margins can be determined from the historical placebo-controlled trials. The factors that are favorable to assay sensitivity, as discussed in the previous section, also support the constancy assumption. In addition, the conditions under which the active control's effect was evaluated should be relatively recent to ensure that clinical practice has not changed substantially since the placebo-controlled trial(s) was(were) conducted. If clinical practice has not changed substantially, it is expected that the effect of the active control compared with a placebo will be consistent with historical data. The constancy assumption may be problematic in the case of subjective and highly variable responses,[8] such as animal pain or behavioral endpoints that result in subjective clinical scores. If the effect of the active control is not relatively constant and consistent from study to study, it cannot be used for a noninferiority study because of the inability to establish a reliable effect of the active control and thus ascertain the assay sensitivity of any trial conducted. In the case where the effect size can be inconsistent attributable to a subjective and variable clinical endpoint, the disease state itself may not lend itself to a noninferiority study. There is a risk that a new drug could be demonstrated as noninferior to an active drug that is ineffective in the clinical trial and lead to the incorrect conclusion that the new drug is effective. Wide clinical use of an ineffective treatment introduces a much broader ethical dilemma than conducting a single placebo-controlled study. For this reason, regulatory bodies have required placebo-controlled, superiority trials to increase the degree of confidence in the decision about the effectiveness of treatment.

Figure 3.

The potential impact if the constancy assumption is violated. As the effect of the active control is less in the noninferiority trial than in the historical negative control trial used for trial design, the effect size of the active control is unknown and the test drug may not be effective at all, even if noninferiority is concluded.

Determining the Noninferiority Margin

A key consideration in the design of noninferiority trials is to select a priori the noninferiority margin (δ).[3] The noninferiority margin is the maximum difference in treatment effectiveness that can be allowed. The choice of δ should be based both on statistical reasoning and clinical judgment, should reflect uncertainties in the evidence on which the choice is based, and should be suitably conservative.[2] Furthermore, the margin cannot be greater than the smallest effect size that the active control would be reliably expected to have compared with a placebo. The process of selecting a margin can be broken down logically into 2 steps. The first step uses statistical reasoning to determine the smallest reliable effect size of the active control compared with placebo, and the second step uses clinical judgment to determine the largest loss of effectiveness of the new treatment that would be considered clinically insignificant while minimally maintaining the smallest effect size, resulting in a single clinically and statistically relevant noninferiority margin.

Smallest Reliable Effect Size

Determining the smallest reliable effect size is the more objective step of determining the noninferiority margin, although some subjectivity still is incorporated into the process. The smallest reliable effect size, denoted here as δ0, also is referred to as margin 1.[7, 8] It is determined based on prior experience with the active control compared with placebo or no treatment. The smallest reliable effect size is considered to represent the whole effect of the active control treatment and it is critical because if the difference between the active control and the new treatment is greater than the effect size, the new treatment may have no effect at all.

Many different approaches have been proposed for establishing a margin based on statistical reasoning,[4, 7, 13, 14] but there is no single accepted method. All methods involve using the results of 1 or more placebo-controlled studies of the active treatment to determine the difference between the active control and placebo. The most straightforward and probably most commonly used procedure is to calculate the lower bound of the 2-sided 90% or 95% confidence interval (CI) of the absolute value of the mean difference between the active drug and placebo. The CI can be constructed from a single well-designed study, or if multiple placebo-controlled studies of the active treatment have been conducted, it can be constructed using a random effects model approach. The use of the lower bound of the 95% CI conservatively incorporates the uncertainty of effect of the active treatment, thus ensuring with a high degree of confidence that if the magnitude of the difference in treatments is less than this value, the new treatment has an effect relative to the placebo.

Clinical Judgment

To determine the noninferiority margin (δ), clinical judgment as well as statistical reasoning needs to be incorporated into its selection. This margin incorporating clinical judgment also is sometimes referred to as margin 2.[7, 8] Incorporating clinical judgment into δ is subjective and little has been published on a recommended procedure. The clinically relevant margin of indifference often is described as determining the largest loss of effectiveness of the new treatment that would be considered clinically insignificant. Obviously, the selected value may vary substantially among individuals, and may likely vary within an individual on a case-by-case basis. For example, consider an active control treatment that has adverse effects that cannot be tolerated by a particular individual. This patient would consider a much larger loss of effectiveness as clinically insignificant compared with a patient that tolerates the active control treatment well. Furthermore, the severity of the disease state being treated strongly determines the magnitude of δ. To ensure that the new treatment has some effect, the selected δ cannot be greater than the smallest reliable effects size (ie, δ ≤ δ0). Therefore, δ0 must be determined before selecting δ. Ensuring that δ is ≤δ0 also allows for straightforward incorporation of statistical reasoning into the selection of the noninferiority margin. Probably the most common procedure for selecting δ is to halve δ0 to maintain at least half of the clinical effect of the active control.[7, 8] An alternative, less conservative, approach is to halve the point estimate of the mean difference between the active control and placebo, and then select the smaller of this value and δ0 to ensure that δ is ≤δ0. Another common “rule-of-thumb” for selecting δ is to select the smaller of the δ0 and 1/2 an SD of the response, as differences in less than that magnitude generally are not considered to be clinically relevant.[9] However, in reality, there is no “one-size-fits-all” approach to selecting δ because of inherent subjectivity in using clinical judgment.

Interpreting Trial Results

Statistical Analysis

The statistical analysis for a noninferiority trial typically proceeds by constructing a 2-sided 95% CI of the difference between the effects of the active control and the test treatment (as given by Equations (3) or (4)). If a smaller effect indicates better effectiveness, then the upper limit of the CI is compared with the a priori selected noninferiority margin (Fig 4). When this upper limit is less than δ (Fig 4, Case A), the null hypothesis given by Equation (4) is rejected and it is concluded that the new treatment is noninferior to the active control. If instead, the upper limit of the CI is ≥δ (Fig 4, Case B and C), there is insufficient evidence to reject the null hypothesis and demonstration of noninferiority of the new treatment to the active control is not achieved. Figure 4 shows different scenarios when a smaller effect indicates better effectiveness and Equation (4) is considered. Similar interpretations can be made when a larger effect indicates better effectiveness and Equation (3) is considered by using the lower CI and comparing it to −δ. For example, when a larger effect indicates better effectiveness and lower limit of the CI is >−δ, the null hypothesis given by Equation (3) is rejected and noninferiority is concluded.

Figure 4.

Three different noninferiority trial results (Case A, B, and C) when a smaller effect indicates better effectiveness and the subsequent study interpretations[7]; δ, noninferiority margin; δ0, smallest reliable effect size.

Although for both case B and C in Figure 4, noninferiority of the new treatment to the active is not demonstrated, the interpretation of the results is somewhat different. In Case B, the upper 95% CI of the difference is >δ but <δ0; so even though the new treatment has not been demonstrated to be noninferior to the active control, it still has an effect greater than the smallest reliable effect of the active control (ie, δ0), indicating that the new treatment is effective. In other words, there is statistically significant (indirect) evidence that the new treatment does have some effect (ie, compared with a placebo or negative control), even if it is has not been demonstrated to be as effective as the active control by an acceptable margin. Thus, if the upper CI is only slightly >δ but still <δ0, some regulatory bodies are willing to consider a new treatment effective and approve the product,[7] especially if it has other important attributes such as improved dosing convenience, compliance, or a better safety profile. The objective of the noninferiority trial also may only be to demonstrate that the new treatment has some effect through “indirect superiority” to a placebo.[6, 7] In that case, this objective should be prospectively stated in the protocol and only δ0 would need to be specified. In contrast to case B in Figure 4, for case C, the upper 95% CI of the difference is greater than both δ and δ0, thus not only is there insufficient evidence to demonstrate that the new treatment is noninferior to the active control but also there is insufficient evidence that the new treatment has any effect at all.

In the case in which the null hypothesis given by Equations (3) or (4) is rejected and it is concluded that the new treatment is noninferior to the active control, a second alternative hypothesis testing of superiority of the new treatment to the active control can be conducted.[7, 15, 16] This second alternative hypothesis for testing superiority is μT − μA > 0 if a larger effect indicates better effectiveness, or μT − μA < 0 if a smaller effect indicates better effectiveness. Then, the lower or upper 95% CI is compared to zero, depending on whether a larger or a smaller effect is of interest, respectively. The conclusion that the new treatment is superior to the active control can be made in the same fashion as described previously for noninferiority hypotheses. This can be conducted posthoc with no inflation in the type I error rate because the null hypotheses of the 2 tests (ie, μT − μA ≥ −δ, δ > 0 and μT − μA ≥ 0) form a nested or closed set.[15, 17] Although the simultaneous testing of noninferiority and superiority in an active control trial is a theoretically sound approach from a statistical perspective, the implementation of this approach in practice is uncommon. Because of the subjective nature of a noninferiority trial and the noninferiority margin, if a study is designed as an active control superiority trial, one cannot test for noninferiority posthoc after failing to demonstrate superiority.[7, 12] In other words, the noninferiority test and noninferiority margin must be prespecified in the protocol.

Intent-To-Treat Principle

For all randomized controlled trials, the intent-to-treat (ITT) principle may be the most fundamental principle that ensures unbiased hypothesis testing.[18] In superiority trials, this principle can lead to lack of power attributable to significant nonadherence to the assigned treatment, but it also prevents false-positive conclusions and preserves Type I error. In noninferiority trials, however, this principle may provide an advantage to the test drug because the effect of nonadherence often dilutes the difference between the test drug and the active control, making them look more similar. It thus can increase Type I error in such cases. Although the fact that the active control is assumed to be effective alters the impact of nonadherence, it is especially important to design and conduct noninferiority trials in a manner to decrease the occurrence of nonadherence.

On the other hand, the per-protocol (PP) analysis includes all patients who completed the full course of assigned treatment and who had no major protocol violations and may be preferable in noninferiority trials.[3] Therefore, in a noninferiority trial, the full analysis set based on both the ITT and PP analysis has equal importance and their use should lead to the same conclusion for a robust study interpretation.[19]

Challenges to Noninferiority Trials in Veterinary Medicine

The validity and interpretability of a noninferiority trial depends largely on being able to make assumptions such as assay sensitivity, constancy of effect, and acceptability of the noninferiority margin. These assumptions cannot be verified in the trial itself, but can only be inferred based on historical evidence from previous studies and how closely the study design features adhered to previous studies. Designing appropriate veterinary noninferiority trials is especially challenging because of several factors, including (1) the lack of historical placebo-controlled studies; (2) changing study designs or conditions, such as patient population or standard of care; and (3) subjective and highly variable clinical endpoints.

A major constraint in designing noninferiority trials for veterinary medicine is that potential active controls may be considered effective, but have not been directly compared with placebo in 1 or more well-designed, randomized clinical trials. The active control may be the standard of care, it may be a drug approved in humans, or it may have been demonstrated to have an effect in the target species that is closely related to the clinical condition in a small laboratory or clinical study. The problem with using active controls such as these is not just the question of whether or not they are effective, it is that without a well-designed, placebo-controlled, randomized clinical trial, the magnitude of benefit is unknown and therefore a margin of non-inferiority cannot be determined a priori. As discussed above, the noninferiority margin cannot be greater than the smallest reliable effect size, δ0, to ensure that the new treatment has at least some effect, but δ0 cannot be determined without at least 1 well-designed, placebo-controlled study of the active control drug. An alternative source of information on δ0 may be good historical knowledge of the clinical response in the absence of treatment. This is rarely the case, however, because current clinical practice would not have allowed a serious disease or condition to be left completely untreated. Without the incorporation of smallest reliable effect size into the selection of the margin, it is quite possible to conclude that an ineffective new treatment is noninferior to an active control, regardless of the efficacy of the active control. Obviously, if the active control is not actually effective (because there is no well-designed, placebo-controlled study, it is quite possible), establishing noninferiority of the new treatment is almost a certainty.

Even in cases in which an active control has demonstrated effectiveness against placebo in a randomized clinical trial, departures from the original design may hinder the determination of a robust effect size (Figs 1 and 2) and undermine the assumption of assay sensitivity. Trial design changes that can have unintended consequences include, but are not limited to, changes in the enrollment criteria (eg, disease severity, age), clinical endpoint (eg, continuous response versus binary treatment failure), and treatment duration. In these cases, it will be difficult to determine the appropriate margin based on historical information, resulting in an overreliance on clinical judgment in its selection. In addition, there is no assurance that the modified design will still maintain assay sensitivity.

Lastly, one of the constraints in veterinary clinical trials is the difficulty of establishing valid and repeatable measures of clinical outcomes. In most cases, the outcome of treatment is subjectively measured and may differ from 1 study to another. The constancy assumption may not hold in the case of subjective and changing endpoints because the magnitude of the effect may change in unexpected ways. Even when the null hypothesis is rejected (Equations (3) or (4)) and noninferiority is concluded based on statistical analysis, conclusively stating that the new treatment is effective can be problematic. For example, if the effect of the active control in the noninferiority trial turns out to be substantially less than the effect of active control in the placebo-controlled study(ies) used to establish the smallest reliable effect size (δ0), doubts can occur about both the appropriateness of the a priori selected noninferiority margin and the weight of evidence that the new treatment has any effect at all (Fig 3). It is possible that the δ0 based on previous trials may have been an overestimate of the true reliable effect size and thus the selected noninferiority margin δ is too wide and cannot be used to rule out inferiority or lack of any effect at all.

Unfortunately, none of the identified noninferiority trials conducted in veterinary medicine justified the selected noninferiority margin adequately (Table 1), particularly with regard to establishing the smallest reliable effect size (δ0). Some of these studies may have selected the noninferiority margin appropriately, but just failed to report the justification. In general, noninferiority studies used either differences in success rates or a ratio of a score/scale as the primary endpoint. Noninferiority margins generally were declared a priori, with a margin of 15 percentage points (15%) for rates and .20 for ratios being most common. In addition, all veterinary noninferiority studies reported in freedom of information summaries by the FDA used rates differences with noninferiority margins of ≤15%. Analgesics and antibiotics were the most common therapeutic class studied in veterinary noninferiority trials, and the majority of noninferiority trials were conducted in dogs or cats. Most likely, the ethical concerns of withholding an established treatment in these types of situations in dogs and cats led to the decision to conduct a noninferiority study instead of a placebo-controlled superiority study.

Example

Choosing the Active Control

Consider the design of a clinical trial for a novel NSAID intended to demonstrate that it is effective for the control of postoperative pain associated with soft tissue surgery in dogs. A placebo-controlled trial would markedly decrease the likelihood of concluding that the novel NSAID is effective when it is not, but it is determined that a placebo-controlled superiority trial is not ethical given that there are NSAIDs and other classes of drugs that have been demonstrated to control postoperative pain. The selection of another NSAID as a positive control is justified because NSAIDs differ in their safety rather than qualitative aspects of analgesia. Moreover, selecting another class of drugs such as opioids may introduce confounding issues attributable to known behavioral effects of opioids as an extension of their pharmacology. As an active control, there are 5 NSAIDs (carprofen, firocoxib, deracoxib, meloxicam, and robenacoxib) that have been approved by the EMA, FDA, or both for control of postoperative pain in dogs. Large (>100 dogs), randomized, placebo-controlled clinical trials for control of postoperative pain associated with soft tissue surgery have been conducted for only 2 of these NSAIDs (carprofen and firocoxib).[20, 21] A single placebo-controlled study of meloxicam for postoperative pain after soft tissue surgery was small, with questionable effectiveness demonstrated.[22] The approval of meloxicam for postoperative pain in dogs was based on a small active control study using 2 different positive control drugs (ketoprofen and butorphanol), where it was demonstrated to be superior to butorphanol alone given before surgery and immediately postoperatively.[23, 24] Robenacoxib has been demonstrated to be noninferior to meloxicam,[25, 26] but no placebo-controlled trial has been published. And finally, deracoxib has not been approved for the control of pain associated with soft tissue surgery in the United States, and it is limited to the control of pain associated with orthopedic surgery as demonstrated in a placebo-controlled trial.[27] The large placebo-controlled clinical trials of carprofen and firocoxib for postoperative pain report different effectiveness endpoints, a repeatedly measured visual analog scale (VAS) and treatment failure rate based on reaching a threshold pain score of ≥8 using Glasgow Composite Pain Scale (GCPS), respectively. As it is generally desired to have a single, easily interpretable effectiveness endpoint, firocoxib was selected as the active control to allow noninferiority to be established using treatment failure rate. Despite a relatively large number of NSAIDs indicated for control of postoperative pain in dogs, it is evident that the options for conducting a well-designed, noninferiority study for a novel theoretical NSAID for the control of postoperative pain associated with soft tissue surgery are limited.

Determining the Margins

Based on the large placebo-controlled clinical trial of firocoxib in the United States for postoperative pain associated with soft tissue surgery, the treatment failure rates for placebo and firocoxib treatment were 24.2% (31 of 128 dogs) and 6.3% (8 of 126 dogs), respectively.[21] Thus, the point estimate of the absolute value of the difference in treatment failure rates was 17.9 percentage points. A 2-sided 95% CI was calculated to determine the smallest reliable effect size, δ0. The lower bound of 95% CI for difference in failure rates between placebo and firocoxib was calculated as 9.3 percentage points. To incorporate clinical judgment into the margin of the difference, it was determined that maintaining at least half of the point estimate of the effect of firocoxib compared with placebo would be sufficient for establishment of noninferiority. At least half of 17.9% result in a clinically relevant margin of 8.9%. As 8.9% is less than that of δ0 (ie, 9.3%), 8.9% was set as δ, incorporating both statistical reasoning and clinical judgment into the a priori selected noninferiority margin. With a noninferiority margin of 8.9% and an assumed failure rate of 6.3% for both firocoxib and the novel NSAID, ~120 dogs per treatment group are needed to maintain at least 80% power for the study design.

Interpreting Study Results

As an illustration, we consider 2 possible hypothetical outcomes to the noninferiority trial design above comparing the novel NSAID to the active control firocoxib with 120 dogs in each treatment group. First, consider the outcome where the treatment failure rates are 6.7% (8 of 120 dogs) and 5.0% (6 of 120 dogs) for the novel NSAID and firocoxib, respectively. The point estimate of the treatment differences (novel NSAID—active, as given by Equation (4) when a smaller effect indicates better effectiveness) is 1.7% points and the 2-sided 95% CI is −4.3% to 7.6%. As the upper bound of the 95% CI of 7.6% is ≤δ of 8.9% (ie, Case A in Fig 4), we reject the null hypothesis given by Equation (4) and conclude that the novel NSAID is noninferior to the active control firocoxib.

Alternatively, consider the outcome where the treatment failure rates are 10.0% (12 of 120 dogs) and 5.8% (7 of 120 dogs) for the novel NSAID and firocoxib, respectively. The point estimate of the treatment differences is 4.2% points and the 2-sided 95% CI is −2.6% to 11.0%. Because the upper 95% CI of 11.0% is >δ of 8.9% (ie, Case B and C in Fig 4), we fail to reject the null hypothesis given by Equation (4) and conclude that there is insufficient evidence that the novel NSAID is noninferior to the active control firocoxib, even though the point estimate is within the noninferiority margin. In addition, because the upper 95% CI is greater than the smallest reliable effect (ie, δ0) of 9.3%, there is insufficient evidence from the trial that the novel NSAID has any effect on the treatment failure rate at all.

Compare the study inferences described above for an appropriately designed and interpreted noninferiority trial to a study design and interpretations where a treatment is compared to an active control as a superiority trial with the hypotheses set up in the familiar fashion given by Equation (1). When the null hypothesis is not rejected, the appropriate conclusion in a superiority trial design is that there was insufficient evidence to demonstrate that the treatments were different. However, all too often in veterinary medicine it is concluded (incorrectly) that the new treatment is effective and equivalent to the active control despite insufficient evidence to support the claim. As illustrated in the second hypothetical NSAID outcome above, there was insufficient evidence that the novel NSAID had any effect. If the study was conducted exactly the same but with the hypotheses set up as in Equation (1), the study would have been an active control superiority trial and the null hypothesis would fail to be rejected at the α = 5% significance level because the 95% CI included zero. It would be incorrect in the case of the superiority trial to conclude that the novel NSAID is as effective as firocoxib. As is often quoted, “Absence of evidence (of a difference) is not evidence of absence.”[28] When the correct conclusion from the superiority trial is made, it does not contradict appropriately interpreted results of the noninferiority trial. The effectiveness of a new treatment cannot be demonstrated in a superiority trial (active or placebo-controlled) unless the null hypothesis is rejected.

Alternatives to Noninferiority Studies

Given the limitations of the noninferiority trials, particularly in the absence of well-designed, contemporary, placebo-controlled trials with an active treatment, other alternative study designs may need to be considered. One option that is sometimes offered is a 3-treatment group noninferiority trial that includes a placebo control, an active control, and the new test treatment. Although this option addresses all of the limitations of the noninferiority trial by allowing for demonstration of assay sensitivity within the trial (which is a substantial improvement to the study design), it is only a viable option if the reason for conducting the trial is to determine the comparative safety and effectiveness of 2 products. Usually, a noninferiority trial is conducted because of ethical concerns of administering a placebo in a study when an established effective treatment exists. Thus, a 3-treatment noninferiority trial does not address the ethical concern of administering a placebo, the very reason the noninferiority design was likely pursued in the first place.

Another possible option involving untreated patients is to conduct a placebo-controlled superiority trial with early escape and administration of a rescue medication attributable to treatment failure. Because many treatment failures are likely to occur in the placebo group, missing data would be a major issue for the completed trial if any endpoint other than treatment failure is used to demonstrate effectiveness, severely limiting the clinical endpoints that can be used in the trial design. The other problem in the design is that it may not sufficiently address the ethical concerns of administering a placebo to patients, depending on the disease or condition being treated (eg, if death is major risk of being untreated). A related but not mutually exclusive option is to identify a subpopulation of patients in which it is ethically appropriate to conduct a placebo-controlled trial. Possible subpopulations may include patients that have failed to respond to the active control or cannot tolerate its adverse effects. A limitation to this study design is that the entire target population of the new treatment is not studied, only a subpopulation, which decreases the inferential value of the trial.

When the new treatment and the established treatment have distinct mechanisms of action, a viable alternative to a noninferiority trial is an add-on superiority trial. In this study design, all patients get the established treatment and also are randomized to also get either placebo or the new treatment. Thus, with this study design, there is no ethical concern of withholding the established standard of care. In addition, the statistical analysis of an add-on superiority trial is the same as that for a placebo-controlled trial with the hypotheses set up in the familiar fashion (ie, Equation (1)), greatly simplifying the statistical considerations of the trial design and providing inherent assay sensitivity in the event that the new add-on treatment demonstrates a benefit over the established treatment alone. However, the major limitation of an add-on trial design is that the effects of the new treatment are not studied independent of the established standard of care. Therefore, regardless of whether or not the new add-on treatment demonstrates a benefit within the trial, it remains unknown if the new treatment would be effective in the absence of the established treatment. An additional limitation of the add-on study design arises if the established treatment has a large effect, because the dynamic range of the response to the add-on new treatment may be limited, potentially decreasing the power of the trial design compared to having an untreated placebo control group.

Another option is to demonstrate superiority of the new treatment compared with the active control. An active control superiority study is only feasible when the new treatment is expected to be substantially more effective than the active control; otherwise, the large size of the study necessary to adequately power it often is not feasible. A final alternative to a noninferiority trial is to conduct a single treatment group-historical control superiority trial. As with a noninferiority trial, good historical knowledge of the disease response in the absence of treatment is needed. However, knowledge is not needed on the disease response with the current standard of care, decreasing the prior study information needed to design the trial. Because there are no concurrent controls within the study, the assay sensitivity of such a study design is not known. Thus, a historical control trial still suffers the same major design limitation as a noninferiority trial and provides no direct information on comparative effectiveness and safety of the 2 treatments.

Ethical Considerations Compared to Placebo Control

The ethics of using placebo controls have been debated elsewhere[29] and such arguments have been used to favor an active control noninferiority trial, when available, so as to do no harm in the conduct of the trial. However, understanding the effect of an investigational drug is often best fully achieved in comparison with a negative control.[12] Some therapies that are known to be effective are no better than placebo in particular trials because of variable responses to drugs in particular populations, unpredictable and small effects, and high rates of spontaneous improvement in patients.[30, 31] In these instances, without a placebo group to ensure validity, the finding that there is no difference between the investigational and standard treatments can be misleading. Thus, all too often it is assumed in these debates that the quality of the evidence from a noninferiority trial is the same as that from a placebo-controlled trial, such that all that is being debated is the ethics of not treating individual patients within the trial itself. In reality, however, noninferiority trials have a large number of caveats and uncertainties associated with them that result in a lower quality of evidence of effectiveness compared with placebo-controlled trials. Therefore, the complete ethical consideration involves weighing the ethics of not treating individual patients within the placebo trial itself, and the ethics of the increased risk of a noninferiority trial incorrectly concluding that an investigational treatment is effective and then subsequently treating a much larger patient population with this ineffective treatment.

Summary

Noninferiority trials usually are pursued because of ethical concerns of administering a placebo to veterinary patients when an established effective treatment exists. Selecting the appropriate active control and an a priori noninferiority margin between the new treatment and the control are unique trial design aspects compared to “classical” placebo-controlled superiority studies. However, without good historical knowledge of the disease response with and without treatment, proper design and interpretation of a noninferiority trial are not possible, and a trial of this design should not be conducted. Despite the appeal of conducting noninferiority trials attributable to eliminating the concerns of administering animals a placebo in a superiority trial, there are limitations and questions that arise with the use of such trials. Alternative trial designs to noninferiority trials do exist when ethical concerns prevent administration of a placebo to animals and good historical knowledge of the disease response is lacking, but these trial designs have their own limitations and must be considered carefully.

Acknowledgments

Disclaimer: This article reflects the views of the authors and should not be construed to represent FDA's views or policies.

Conflicts of interest: Authors disclose no conflict of interest.