### Statistical Methods for Cost-Effectiveness Evaluation

*Joint comparison of costs and effects and assessment of sampling uncertainty.* A joint comparison of costs and effects using the incremental cost-effectiveness ratio (ICER) or the incremental net monetary (health) benefit (INB) is a useful decision tool to help determine whether the new therapy offers good value relative to the alternative. The use of this tool is particularly important when there is a trade-off between costs and effects; that is, one therapy is both significantly more effective and more costly compared with the other therapy. If there is no trade-off between costs and effects, that is, when one therapy is significantly more effective and less costly when compared with the other therapy, this decision tool may not be necessary because that therapy is unambiguously dominant over its alternative. A third possibility occurs when the two treatments have the same effect. In this case, some authors have interpreted textbooks and guidelines on health economic evaluations to suggest that a cost-minimization approach is sufficient (i.e., the lowest-cost treatment is the treatment of choice) and there is no need to perform a joint comparison of costs and effects [3–6]. Nevertheless, as our understanding of sampling uncertainty for the comparison of costs and effects has grown, the cases where this interpretation is appropriate have shrunk.

Because cost-effectiveness ratios and net monetary benefit estimated from trial data are the result of samples drawn from the population, one should report the uncertainty in this outcome that derives from such sampling [7]. Identification of methods such as confidence intervals for cost-effectiveness ratios [8–11], acceptability curves [12], and confidence intervals for net monetary benefit [13] for the measurement of this uncertainty have been important methodologic developments in the economic evaluation of medical therapies [14]. When one uses these methods, a finding of significantly lower cost and an indistinguishable clinical outcome need not guarantee, one can be confident that the significantly less expensive therapy is good value. As a result of uncertainty, the cost-minimization approach has been shown to be rarely appropriate as a method of analysis and the need for a joint comparison still remains under most circumstances [15]. Alternatively, because it is possible to have more confidence in the combined outcome of differences in costs and effects than in either outcome alone, observing no significant difference in costs and effects need not rule out that one can be confident that one of the two therapies is good value. In these cases, one should compare costs and effects, and one should report on their sampling uncertainty.

**Estimation of incremental costs. ** For economic analysis, costs and cost differences between treatment groups should be expressed by the use of the arithmetic mean, and not medians, because this summary measure permits a budgetary assessment of treatment (N × arithmetic mean = total cost) and is the statistic of interest for health-care policy decisions [1,2]. The most common statistical test for arithmetic mean differences between treatment groups is the parametric *t*-test. Because of the often highly skewed distribution of cost data, the normality assumption underlying this test is often called into question and standard nonparametric tests (e.g., Mann–Whitney *U*-test or Wilcoxon rank sum test), or parametric tests on normalizing transformations (e.g., log transformation) are often used as a substitute. Yet these popular alternatives are not appropriate for drawing statistical inferences on differences in arithmetic mean costs [16–18]. For example, when one uses a *t*-test to evaluate the log of costs, the resulting *P*-value has direct applicability to the difference in the log of costs and to the difference in the geometric mean of costs. It may or may not be directly applicable to the arithmetic mean costs. Similarly, when one uses a Mann–Whitney *U*-test, one is testing differences in the median of costs. Thus, statistical inferences about these other statistics may not be representative of inferences about the differences in arithmetic mean, which is the statistic of interest.

If one does not want to adopt a parametric *t*-test to directly test for differences in arithmetic mean costs, one can compare the arithmetic means by using a nonparametric bootstrap. This procedure has the added advantage of avoiding a parametric assumption about the distribution of costs. As a result, the nonparametric bootstrap has increasingly been recommended either as a check on the robustness of standard parametric *t*-tests, or as the primary statistical test for making inferences about arithmetic means for moderately sized samples of highly skewed cost data [18–20].

Even if treatment is assigned in a randomized setting, some authors use multivariable techniques to analyze costs. Multivariable analysis of costs may be superior to univariate analysis because it improves the power for tests of differences between groups (by explaining variation due to other causes). It also facilitates subgroup analyses for cost-effectiveness, for example, more and less severe; different countries/centers, etc. Finally, it accounts for potentially large and influential variations in economic conditions and practice patterns by provider, center, or country that may not be balanced by randomization.

Adoption of multivariable analysis does not, however, avoid the issues that arise in the univariate analysis of cost. For example, regressions on the logarithmic transformation of costs were previously considered an ideal remedy to the violation of the assumption of normally distributed error term that underlies ordinary least squares (OLS) regression. Nevertheless, as the shortcomings of multiple regression models of log transformed costs became more widely publicized [17], the use of the generalized linear models have become the accepted alternative [21–23].

**Handling of incomplete cost data. ** Incomplete or censored cost data occur in most randomized trials that follow participants for clinically meaningful lengths of time. Whether cost data were incomplete, the amount of incomplete data and the statistical method adopted to address the problems posed by censoring incomplete data should routinely be reported in trial-based analyses [2]. Although there exists a mix of approaches to impute the cost data, recent statistical interest in addressing censored cost data has led to the proposal of several methods of estimation that explicitly account for incomplete data loss-to-follow-up [24–30].

It is well-established that these methods are prone to less bias than other naive estimation methods wherein censored observations are either excluded from analysis (i.e., complete-case analysis) or included as though they were complete observations,(i.e., full-sample analysis) [25,26,28,31–33]. In the first naive approach, only the uncensored cases are used in the estimation of mean cost and this method is biased toward the costs of the patients with shorter survival times because patients with larger survival times are more likely to be censored [25,32]. Also completely discarding patients with censored data can lead to the loss of information and statistical power, which can be problematic if the percentage of censored cases is high. The second naive approach which uses all cases without differentiating between censored and uncensored observations is always biased downward because the costs incurred after censoring times are not accounted for [32].

### Study Selection

This review included published studies evaluating economic outcomes based on patient-specific cost or resource-use data collected in randomized controlled trials. A search was conducted in the MEDLINE database as of September 2004 for all studies which included terms related to costs (e.g., “cost(s),”“economic evaluation(s),” or “health economic(s)”) and clinical trials (e.g., “trial(s)” or “randomized controlled trials”) in the title, abstract, or MeSH headings. The search was limited to publications in English, involving human subjects, and was published during 2003. This search identified approximately 650 eligible articles. A majority of these studies were excluded upon review of the study abstract wherein it was clear that they were not reporting on clinical trial-based economic results. The full text was reviewed for 162 articles. Studies were excluded because either the study did not collect or analyze patient-level costs, or clinical trial data were applied in a decision-analytic model, or if the study in fact was not a randomized trial. This resulted in 115 articles being finally included in the review.

### Data Abstraction

Data were extracted by using a specially designed data abstraction form. The first part of the form collected general study information such as country where trial was conducted, broad clinical area, and type of intervention. The second part of the form collected specific information on the economic outcome studied, the analysis of costs, and the approach to handling incomplete data. We first determined whether a joint comparison of costs and effects was performed in each study; and if not, whether it was justified. For studies that estimated an ICER or INB, we examined whether and how stochastic uncertainty was estimated. We then focused on the analysis of cost data in terms of how costs were summarized, statistical test used to compare the costs across treatment groups, and multivariate technique used to report an adjusted incremental cost estimate. Lastly, we collected information on whether the study reported incomplete cost data and technique, if any, used to address the problem.

Assessments were carried out by one assessor (J.A.D.). The reliability of the data abstraction was monitored using an independent assessment by a second author (H.A.G.) of a 20% random sample of the 115 studies. Agreement was complete for the items reported in this article. Only in the case of one item (i.e., technique for handling incomplete cost data) was discussion needed to determine the classification, because the reporting of methods to account for censored cost data was unclear in several studies.

### Reporting of Results

We report the number and proportion of the 115 studies that conform and do not conform with each of the principals for statistical evaluation of cost-effectiveness set forth above. We also investigate whether the statistical methods used were associated with the number of participants in the randomized trial. To do so, we report selected results stratified by the sample size of the study (fewer than 200 subjects, between 200 and 999 subjects, and 1000 or more subjects).