### ABSTRACT

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

This article is part of a series of manuscripts dealing with the incorporation of patient-reported outcomes (PROs) into clinical trials. The issues dealt with in this manuscript concern the common pitfalls to avoid in statistical analysis and interpretation of PROs. Specifically, the questions addressed by this manuscript involve the analysis pitfalls with PRO data in clinical trials and how can they be avoided (e.g.,missing data, multiplicity, null results etc.). The manuscript provides key literature for existing resources and proposes new guidelines.

### Introduction

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

This article deals with issues of statistical analysis and interpretation of patient-reported outcome (PRO) data. The primary focus and context relate to supporting a labeling or advertising claim of a PRO benefit for a new or approved pharmaceutical product. The issues we discuss are not unique to pharmaceutical and regulatory applications, so the information may be generalizable to other clinical trials of other medical interventions involving PRO end points. For this article, we are assuming that the PRO is an important effectiveness end point in the study and that the intent of the research program is to achieve a labeling or promotional claim. We are also assuming that the PROs were selected based on a strong rationale, are credible, are appropriate, and have evidence supporting systematic development and psychometric qualities in the particular study population [2]. Other articles in this series focus on best practices for PRO instrument development and psychometric evaluation [3–9].

In addition, we think that achieving a PRO claim must require the a priori specification of a statistical analysis plan. We do not endorse basing a PRO claim on a post hoc analysis. The statistical analysis plan should be prescriptive and restrictive in terms of the analysis undertaken. The statistical analysis plan should detail the methods for handling missing data, multiplicity, and other relevant issues associated with the PRO data analysis. Any post hoc analysis that produces supplementary results should be considered interesting findings that should be confirmed in future studies.

This article focuses on four major areas related to statistical analysis of PRO data. In the first section of the article, we deal with commonly seen pitfalls including missing data, multiplicity of end points, blinding, choosing the correct end point for analysis, and the role of sensitivity analyses. In the second section, we discuss what to do when results turn out to be nonsignificant.

### Additional Ways to Handle Missing Data

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

In this section, we cover methods for handling missing data. Particularly, we cover the intention to treat (ITT) principle, the use of summary statistics (e.g., composite end points, indicator variables, and area under the curve [AUC] analysis), and imputation methods.

#### Intention to Treat Principle

Missing data may be handled analytically by the ITT principle or by the modified ITT principle in which at least one post-treatment measurement is required. Intention to treat reduces the amount of missing data or removes the issue entirely by labeling an individual as a success or failure once they have initiated treatment, regardless of whether they completed the trial.

In the following example, we translate the ultimate PRO outcome into a binary (dichotomous) outcome. In a trial of erythropoietin alpha [29] for improving hemoglobin [30], patients were scheduled to receive EPO for 16 weeks. Key end points were whether or not a patient achieved a specified clinically important increase in overall quality of life (QOL) from baseline as measured by a simple linear analog. All patients were evaluable for the PRO end point if they initiated treatment. If they failed to provide data for a given week, they were declared a treatment failure for that week. Unless an observed clinically meaningful increase in QOL was observed after baseline, the patient was declared a treatment failure. Complete data were obtained for this end point even though some patients had intermittent missing data.

#### Summary Statistics

Missing data can be handled by scientifically credible summary statistics. Summary statistics provide a method to combine available data into an overall end point. In anorexia trials, for instance, the maximum patient-reported improvement in appetite over a study period can be used as an end point to remove missing data issues. As another example, the American College of Rheumatology (ACR) response criteria constitute a composite end point developed as a primary end point for clinical trials comparing rheumatoid arthritis treatments [31,32]. The response criteria combine clinical and PRO measures, including number of swollen and tender joints, patient-rated disease activity, clinician-rated disease activity, patient-rated pain, Health Assessment Questionnaire (HAQ), and c-reactive protein (CPR) values.

The scientific veracity of using summary statistics is well validated and must be identified a priori. It should also have evidence supporting reliability and validity in the target indication. The composite end point has to be validated by some reasonable scientific approach and not just concocted de novo. Going through a rigorous procedure including appropriate professional input and scientific scrutiny is essential to the development of any summary statistic. Summary statistics in general may be difficult to interpret because they typically lack a natural conceptualization of what is meaningful for what are essentially artificial summary indices.

An indicator variable is another type of summary statistic indicating whether a specific benchmark has been achieved (such as an a priori specified clinically meaningful benefit). Indicator variables produce complete data and binomial variables for which a standard statistical methodology such as Fisher's exact test and logistic regression readily apply. This is a preferred approach where applicable because it requires distinct a priori scientific justification for the handling of missing data and produces an appealing analytical approach. Unfortunately, the advantages of the binary approach come at the possible expense of loss of information.

Counterarguments to using this dichotomization approach of success and failure is that it loses the statistical power of the continuous end points. The key issue is whether the information gleaned from the PRO is sufficiently precise to be used as a continuous variable or whether the perceived level of accuracy is merely artificial numeritization. For example, the use of linear analog response scales data measured to the nearest hundredth of a millimeter is inappropriate because people are barely accurate to the nearest millimeter in responding to such items.

Area under the curve summary statistics are produced by combining longitudinal data into a simple, single numerical entity. At its simplest level the AUC represents the average value of the PRO over time for the entire treatment period. Some approaches use various distributional assumptions such as exponential decay between time points or the trapezoidal rule in estimating values between adjacent assessments. The justification for alternative assumptions needs to be based on scientific principles a priori rather than data-based post hoc determination.

Sloan [33] and Huntington and Dueck [14] provide examples of the technique in general. Sensitivity analysis is indicated if the distributional assumptions are questionable. Typically, the assumptions do not strongly affect the treatment comparison and so a small number of alternative assumptions can be tested before the consistency of the results becomes obvious. Missing data are handled in AUC construction in many ways, but typically by nearest-neighbor imputation or simply by constructing the AUC curve using the available data and prorating via the proportion of reporting periods.

### Imputation Methods

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

In this section, we discuss single and multiple imputation methods. Single imputation is substituting a single value for each missing value in a data set. On the other hand, multiple imputations involve substituting several values for each missing value; in essence, creating multiple data sets. The statistic of interest is computed for each imputed data set and a final estimate is computed by combining estimates across the multiple data sets with readily available formulas in the multiple imputation literature.

There are many methods for selecting the values to impute whether using single or multiple imputation. Despite the myriad imputation techniques available, there is no uniformly accepted approach for imputation. All imputation methods are heavily reliant upon the underlying assumptions of the analytical technique. Results of any technique hence can be as much a result of the underlying assumptions as the true empiric result. Regardless of the technique, all methods involve statistically “guessing” what a person's PRO result would have been had it been observed. As such, imputation is “fudging the numbers” at its core. It is important to keep this truism in mind as alternative methods for imputation are considered to minimize the artificiality of the final results.

Sloan [33] provides examples of how different methods of handling missing data can provide startlingly different PRO average profiles over time. The most important issue is not the individual average profile, however, it is the relative difference between treatment regimens that is impacted by such missing data. More often than not (though definitely not in all cases), the missing data will impact both treatment average profiles similarly, keeping the comparative analytical results the same.

We recommend a differentiation between patients who do not provide PRO data because of death and those who do not provide data because of other reasons (e.g., missing pages in booklet, dropped study because of toxicity, etc.). Investigators should specify ahead of time how they will handle data for patients who die during the study; this permits them to examine this group of patients separately from those who were potentially available to provide data at any given point but failed to do so. Imputation of zeroes (or the lowest possible score for a PRO) for those who have died, for example, is a technique in applying the intention to treat principle to patient PRO data but may result in significant bias in the presence of large amounts of missing form data [34]. Diehr and colleagues provide other approaches to incorporating mortality into PRO end points [1,35].

We do not endorse any particular method of imputation as being applicable to all situations; neither do we believe that any given method is prohibited in a specific circumstance. Each study should have an appropriate rationale that considers the expected pattern of missing data and imputation method chosen.

#### Single Imputation

The most common missing data approach in labeling claims is to use last-value carried forward in which the last observed value is substituted in for all subsequent missing data values. This method has been criticized for various reasons [24]. Alternative approaches involve imputing the average-value carried forward, the minimum or maximum value, or even imputing a zero value for patients who have died. This last approach has not seen much use but has the advantage of reflecting the average value for the complete sample that initiated the study. Some have argued that imputing a zero value for death may be inaccurate for a number of reasons [35]. The authors recommend that a single simple imputation method be specified for the primary analysis and justified within the context of the study.

Sensitivity analyses should include up to three different imputation approaches to verify that the results of the primary analysis are credible. If the results are not comparable, then this indicates the need for more complex missing data modeling and may suggest uncertainty regarding the PRO results.

#### Multiple Imputation

The use of complex statistical models to impute several values for missing data has become the topic of many statistical articles in recent years [21,22]. Whether linear and nonlinear models provide any better scientific “guesses” at missing numbers than common sense and single imputation has yet to be shown. The gains in terms of statistical power likely do not balance the amount of work and the number of strong distributional assumptions that are required. Nevertheless, because of the error inherent in “guessing,” multiple imputations have the benefit over single imputation of a built-in measure of added variability in estimates computed from imputed data. (In multiple imputations, the estimation process considers the variability of estimates across the imputed data sets.) This benefit again likely does not balance the amount of work and required strong distributional assumptions because the added variability because of imputation can be assumed small if imputing for a small percentage of data and the simpler single imputation method is chosen appropriately.

In essence, if the multiple imputation methods provide a different answer than the simple single imputation methods, investigators may not know whether this is a result of improved precision or difference in technique. Recent literature has demonstrated that the result differs little across approaches [36].

#### Multiplicity

Analyzing multiple end points can be handled post hoc statistically or a priori scientifically. Although dealing with issues in the initial protocol is preferable, it is reasonable to expect that some issues are best addressed by statistical correction. We recommend, however, that investigators still prespecify the use of statistical correction in their statistical analysis plans. Many PROs, particularly those that assess health-related QOL or multiple facets of a disease are, by nature, multidimensional. Other clinical end points (such as tumor response) are also inherently multidimensional but have been worked into acceptable, agreed-upon single end points (e.g., a 50% reduction is a response). For example, trialists do not report average tumor size reduction or describe the characteristics associated with tumor response.

The best way to deal with multiplicity is to define in advance in the protocol or statistical analysis plan the PRO domains that the treatment is expected to affect and the domains that are not expected to be affected. Any given treatment is unlikely to affect any disease condition uniformly across all PRO subdomains. Additionally, in health-related QOL some dimensions may not change with treatment, such as concern about having a disease. These situations are analogous to the expectation that each treatment will produce variable toxicity profiles or disease-free survival, but not overall survival, may be affected. In all such situations, the end point of interest must be decided in advance and based on scientific evidence. We and others [37] recommend selecting a small number of PRO end points as being of primary interest and viewing remaining domain scores as secondary and supportive end points.

Multiple end points can be handled statistically by comparison-wise corrections to the Type I error rates. All these methods require an expanded sample size to account for multiple testing over and above what would be required for a single hypothesis test. The Bonferroni approach is most common; in it, the comparison-wise level of significance is set by dividing the number of tests involved into the overall experiment-wise Type I error rate. For example, if four PRO domains are to be tested, then an overall 5% Type I error rate is obtained if each test is carried out at the 5%/4 = 1.25% significance level. This approach is considered quite conservative. Modifications of the Bonferroni approach involve specific algorithms of ordered testing, referred to as step-down, step-up, or hierarchical analyses [38,39].

#### Multiple End Points for Label Claims

Repeated or multiple testing of outcomes (or even of the same outcome at different times) will inevitably result, incorrectly, in a statistically significant treatment difference (say, *P*-value < 0.05) when no true difference exists between the treatments. Two forms of multiplicity exist. One form is multiple testing of multiple domains at the same time; a case in point is comparing multiple domains of the EORTC between a pair of treatments at week 12. Another form is multiple testing of a given domain at multiple times; comparing the emotional function domain of the EORTC between a pair of treatments at weeks 3, 6, and 12 illustrates this point.

Addressing the multiplicity of end points and time points is contingent upon study objectives and hypotheses, so no one definitive strategy for addressing multiplicity can be said to be appropriate for all studies. Nonetheless, general guidelines can be offered that are consistent with the specific hypotheses with regard to the domain(s) and the time point(s), both of which should be specified in advance. For a label claim, the number of prespecified domains of primary interest should be limited to no more than five and preferably to no more than three domains [23]. The number of key time points should also be limited and prespecified to testing treatment differences at no more than two time points for the primary analysis.

Suppose the objective is to show whether treatments differ in the mean change score from baseline to week 12 in five particular domains. Three types of alpha or *P*-value adjustments are recommended: 1) Bonferroni; 2) Bonferroni-Holm (Step-Down) Procedure; and 3) Hochberg's (Step-Up) Method. Of the three, Hochberg's Method is generally preferred. The following example illustrates testing for treatment differences in K = 5 domains where the five *P*-values are 0.20, 0.006, 0.011, 0.018, and 0.021.

If *P*(i) > α/K, then accept the null; if *P*(i) ≤ α/K, then reject the null.

Ordered *P*-values: | *P*(1) = 0.006 | *P*(2) = 0.011 | *P*(3) = 0.018 | *P*(4) = 0.021 | *P*(5) = 0.20 |

α/K: | α/5 = 0.01 | α/5 = 0.01 | α/5 = 0.01 | α/5 = 0.01 | α/5 = 0.01 |

Decision: | Reject | Accept | Accept | Accept | Accept |

“Reject” means reject the null hypothesis that no treatment difference exists and therefore conclude that a treatment difference exists. “Accept” means do not reject the null hypothesis that no treatment difference exists and therefore conclude that no treatment difference exists.

Bonferroni-Holm (Step-Down) Procedure [38]:

- •
Step-Down: Start with smallest *P*-value.

- •
If *P*(1) > α/K, then accept all null hypotheses (no treatment effects) and stop.

- •
If *P*(1) ≤ α/K, then the first null hypothesis [corresponding to *P*(1)] is rejected and then compare *P*(2) with α/(K − 1).

- •
If *P*(2) > α/(K − 1), then accept all remaining null hypotheses and stop.

- •
If *P*(2) ≤ α/(K − 1), then the second null hypothesis [corresponding to *P*(2)] is rejected and then compare *P*(3) with α/(K − 2).

- •
Compare *P*(3) with α/(K − 2) and proceed in like fashion.

Ordered *P*-values: | *P*(1) = 0.006 | *P*(2) = 0.011 | *P*(3) = 0.018 | *P*(4) = 0.021 | *P*(5) = 0.20 |

α/K,α/(K−1),. . . = | α/5 = 0.01 | α/4 = 0.0125 | α/3 = 0.0167 | α/2 = 0.025 | α/1 = 0.05 |

Decision: | Reject | Reject | Accept | Accept | Accept |

Hochberg's (Step-Up) Method [39]

- •
Step-Up: Start with largest *P*-value.

- •
If *P*(K) ≤ α, then reject all null hypotheses and stop.

- •
If not, accept the first null hypothesis [corresponding to *P*(K)] and compare *P*(K − 1) with α/2.

- •
If *P*(K − 1) ≤ α/2, reject all remaining null hypotheses and stop.

- •
Otherwise, this second null hypothesis is accepted.

- •
Compare *P*(K − 2) with α/3 in like fashion.

Ordered *P*-values: | *P*(1) = 0.006 | *P*(2) = 0.011 | *P*(3) = 0.018 | *P*(4) = 0.021 | *P*(5) = 0.20 |

α/Κ, . . . ,α/2,α = | α/5 = 0.01 | α/4 = 0.0125 | α/3 = 0.0167 | α/2 = 0.025 | α/1 = 0.05 |

Decision: | Reject | Reject | Reject | Reject | Accept |

If interest centers on the difference in the mean change between treatments across time instead of at a specific time, then a summary measure of the domain scores can be created and a statistical difference between a pair of treatments tested using a multiple comparisons procedure like Hochberg's (Step-Up) Method.

Another way of handling multiple end points statistically is collective multiple testing, such as O'Brien's global test, to produce a single test of hypothesis. Similarly, one might use multivariate hypothesis testing such as Hotelling's T^{2} or multivariate analysis of variance (MANOVA) [40–42]. We caution against using MANOVA, however, because it requires complete data which is rarely the situation in most pharmaceutical trials. Within the realm of MANOVA, however, a hypothesis testing approach known as profile analysis may be applicable in testing PRO claims. A profile analysis is a hierarchical approach to multivariate comparisons between treatment groups. It typically proceeds in three steps: 1) a test for overall differences in average values; 2) a test for equality of levels; and 3) a test for differences over time. Collectively, the profile analysis represents a complete picture of the results. Again, the importance of a priori hypothesis specification is vital to the appropriate application of these statistical techniques.

#### Blinding

The absence of blinding (masking) of subjects to treatment group assignments is frequently raised as a potential source of bias for PROs, because subjects may believe that the newer treatment is somehow better and therefore may report improved health outcomes, even in situations in which they do not feel better. When possible, we recommend masking subjects to treatment and completely avoiding this possible source of bias in PROs. Sometimes, it is not possible to mask treatment assignments, but often in these situations two (or more) active treatments are evaluated in the clinical trial. In this case, it is important to make sure that the PRO assessments are performed before any clinical assessments or procedures are undertaken which might influence patient perceptions of their health state, and care must be taken to evaluate whether bias may be present. For the assessment of bias, it is important to examine whether the patient reports of symptom or health status improvement are tracking with objective clinical measures and clinician reports of change in clinical status. This bias in patient reporting may be most critical for short-term studies, as it may be difficult for subjects to continue to report improved symptoms and health-related QOL in the absence of any real effect over longer periods of time (i.e., the initial expectations for benefits may wear off with increased experience of no benefit).

#### Choosing the Correct PRO End Point

Much statistical literature has appeared about whether changes from baseline, average values at a particular time point, or percentage of successes should form the basis for analysis to compare treatment regimens. Although different statistical significance levels are possible for each of these three types of analyses, in most settings, analyses should result in consistent interpretations with respect to treatment efficacy. As an example, see the study by Rummans and colleagues of a psychosocial intervention which indicated statistical significance for all three end points [43]. Again, the a priori research hypothesis should be the primary source for the decision as to which end point is the most appropriate in a given situation.

As mentioned earlier, the intention to treat principle can be applied to identify a patient as a “success” or “failure” with respect to treatment outcome. This approach may be desirable if dichotomous outcomes are reasonable or where analysts expect to have many study dropouts (i.e., metastatic cancer). Otherwise, continuous end points such as average values per treatment (whether at a given time point or change from baseline) may be preferred to keep the sensitivity of a continuous end point. Change from baseline end points typically take precedence over average values at a given time point if researchers believe that treatment efficacy is related to baseline values. For an analysis of covariance model with baseline PRO as the covariate and treatment group as the key explanatory variable, the treatment effect and its standard error will be identical whether the dependent variable is the follow-up PRO or change in PRO from baseline.

#### Sensitivity Analyses

Sensitivity analyses are not always needed or required when conducting statistical analysis of PRO data. In some situations, properly conceived sensitivity analyses can help support and confirm the findings from the primary PRO data analysis. Most frequently, sensitivity analyses are recommended when the level of missing data is high (>20%), when a generally accepted method for imputing missing PRO data is lacking, or the best method for imputing missing PRO data is uncertain [44].

The planned sensitivity analyses should directly inform and address this uncertainty and the problems associated with missing PRO data. The sensitivity analyses can incorporate different approaches to imputing missing PRO data, such as substituting the worst possible (or observed) score for missing data, multiple imputation, and other methods [23]. Alternatively, different and somewhat more complicated statistical models, such as the family of pattern mixture models or selection models [11,24] can be used to compare the effects of treatment on PRO end points. It is best to complete a small, focused number of sensitivity analyses that are relevant and fit the particular situation and that help address any uncertainty related to the PRO analysis and findings. If these alternative imputation and statistical analysis strategies produce results that are similar to those of the primary PRO data analysis, the findings are further supported and confirmed. If the results of the sensitivity analyses are disparate with the primary PRO data analysis, then some question remains about the PRO results, and the investigators may need to provide further explanation for the PRO results.

Large clinical trials often include multiple clinical sites in potentially many different countries. Hence, using translated versions of the PRO of interest is common practice. Translations of the PRO measures should be conducted according to standardized, accepted methods with cognitive testing (linguistic validation) of the translated measure in the countries where the measure will be used.

Analysts should test for interaction of treatment with study site, country, or region to provide statistical assurance that the translations of the PRO did not differ across sites by treatment group. “Revalidation” of the PRO in each country or region to examine its psychometric properties is not necessary. In a randomized clinical trial, any differences in the PRO from country to country will incorporate more noise in the measurement and, hence, decrease the ability to detect differences between treatments. There should be no reason that any decreased sensitivity of a PRO that has not been ideally translated would differ between randomized groups any more than a clinical measure would, and the test of interaction will provide this assurance.

### How Should Null Results Be Interpreted?

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

Patient-reported outcome measures often include several different domains or summary scores. Positive treatment effects may be found in only a subset of these domain scores. The question then arises as to how to interpret so-called “null” results for scores that do not demonstrate statistically significant differences between treatment groups. The problem of how to interpret null results is not unique to PROs. It also arises with composite clinical end points where statistical power is often inadequate to show a statistically significant improvement in each individual component of a composite end point. Another example concerns different bacterial or viral subtypes in a clinical efficacy trial of a composite vaccine, where the end point will typically be infections caused by any of the types in the vaccine. The sample size is often insufficient to expect statistical significance for each subtype separately.

To approach the problem of interpreting null results, investigators should begin by appropriately prespecifying the hypotheses to be tested, the order in which they are to be tested, and any multiplicity adjustments. The choice of method and the ordering of the hypotheses to be tested are determined by the study objectives and by the power for different hypothesis tests. With the proper prespecified statistical hypothesis-testing plan, the interpretation of null results is greatly clarified. Hypothesis-testing plans are designed to preserve the overall Type I (alpha) error while providing adequate power for meaningful tests of a limited set of multiple hypotheses.

When data from a clinical trial are analyzed one should claim only what one can demonstrate with convincing supportive data. Null results are reported in the clinical study report and should also be addressed in journal publications. The question of whether they would need to be mentioned in drug labeling would depend on their importance for interpreting the positive results that have been shown. Null results that call into question the validity of positive results (e.g., those demonstrating an actual negative effect of the treatment) would need to be interpreted differently from secondary end points that simply failed to achieve statistical significance in the particular order of hypothesis-testing used in the trial. For example, clinical trials of new treatments for rheumatoid arthritis often find that new treatments demonstrate significant improvements in the SF-36 physical component but not, as expected, in the mental component. The labels for recently approved rheumatoid arthritis treatments (i.e., Remicaid, Enbrel, Humira) include statements on the treatment effects on both the physical summary and mental health summary measures.

In general, ensuring fair and complete reporting of PRO end points based on a clinical development program for a new medication is imperative. The focus should be primarily on prespecified PRO end points and those end points that reach statistical and clinical significance criteria. The sample sizes should allow for sufficient power to detect differences if these differences actually exist between the study treatments. In addition, investigators must provide sufficient evidence and rationale supporting the selection of PRO measures for the clinical trials, because measures with poor psychometric characteristics that do not adequately cover the relevant PRO domains are unlikely to detect treatment-related differences. Nevertheless, even with psychometrically sound measures, a priori specification of primary PRO end points, and well-designed clinical trials, unexpected patterns of findings may emerge. In these situations, investigators should report all the prespecified PRO end points, whether or not they support the treatment.

The problem of how to interpret null results is not unique to PRO measures. The approach to interpretation is greatly clarified by clear specification of the statistical hypothesis-testing and multiplicity-adjustment framework. Null results that call into question the validity of positive results need to be interpreted differently from those that simply represent hypotheses for which statistical significance was not achieved in the particular testing plan used in the trial.

A specific example of this, in a nonlabeling setting, involved a psychosocial intervention designed to impact overall patient health-related QOL [43]. Although other domains of QOL were included, the primary testing and analysis was carried out on overall QOL because it was the treatment target. The results indicated that overall QOL was indeed impacted by the intervention, although none of the other subdomains of QOL changed significantly.

In a similar fashion, if different aspects of fatigue were measured for a labeling claim, it would be the sponsor's responsibility to define, a priori, which aspects of fatigue would be likely to be impacted. Ultimately, whether one end point is selected as primary, or multiple coprimaries are selected, it must be supportable by scientific argument a priori. Post hoc multiple testing should not be allowed under any circumstances to dredge the data for potential labeling claims.

### Conclusions

- Top of page
- ABSTRACT
- Introduction
- What Are the Analysis Pitfalls with PRO Data in Clinical Trials and How Can They Be Avoided?
- Additional Ways to Handle Missing Data
- Imputation Methods
- How Should Null Results Be Interpreted?
- Conclusions
- Acknowledgments
- References

Statistical analysis and interpretation related to results based on PROs can support a labeling claim with the same scientific integrity that is achievable for other end points, as long as the design elements necessary for credibility delineated in earlier manuscripts in this series are incorporated into the clinical trials. PRO data should be handled and viewed like any other effectiveness data in clinical trials.

The statistical analysis plan must be clear and consistent in justifying the various assumptions and processes used. It is critical to specify a priori what primary end point(s) will form the basis of the statistical analysis of the claim. Particular importance needs to be paid to the handling of missing data, the multiplicity of end points, and the longitudinal data structure. Methods for dealing with many of these analytical issues now exist and guidelines for their appropriate use are available. The FDA guidance document appropriately indicated that, methodological advances aside, there is need for further exploratory and confirmatory research in some areas. A body of evidence is accumulating, as articulated in this manuscript, that will continue to provide exemplary applications for statistical analysis and interpretation of PRO assessment in clinical trials.