Utility-based outcomes made easy: The number needed per quality-adjusted life year gained. An observational cohort study of tumor necrosis factor blockade in inflammatory arthritis from Southern Sweden
To introduce a novel, simple, utility-based outcome measure, the number needed per quality-adjusted life year (QALY) gained (NNQ), and to apply it in clinical practice in anti–tumor necrosis factor (anti-TNF)–treated patients with rheumatoid arthritis (RA), psoriatic arthritis (PsA), and spondylarthritis (SpA).
The NNQ is the number of patients one has to treat in order to gain 1 QALY. It is calculated as the inverted value of the utility gain (area under the curve) over 1 year in a cohort subjected to an intervention. EuroQol Index utility data from the South Swedish Arthritis Treatment register were used.
Patients with RA (n = 1,001), PsA (n = 241), and SpA (n = 255) were eligible for the study. First, second, and third treatment courses were studied. For RA, NNQ was 4.5, 6.4, and 5.2 for first, second, and third courses, respectively. For PsA and SpA, NNQ was 4.2–4.5, irrespective of treatment order. Treatment groups with <50 patients were not analyzed. During the study period 2002–2007, there were no secular trends of utility gains.
The NNQ is an easily derived and understandable utility-based outcome measure that may be useful for stakeholders and decision makers as well as for clinicians. It was readily applied in this study of TNF blockade across 3 arthritis diagnoses. NNQ varied little over diagnoses and treatment course order, with a possible exception in second treatment course in RA.
In the current trends of new and costly modalities of arthritis treatment, interest in treatment evaluation from a health-related quality of life (HRQOL) and economic point of view is increasing. Health economic studies generally involve the gathering of real costs and complicated mathematical models, and they are seldom very transparent. There seems to be a need for a simple and intuitive measure for the extent to which an intervention is worthwhile. In an attempt to fill this need, we propose a new, utility-based outcome measure, the number needed (to treat) per quality-adjusted life year (QALY) gained (NNQ).
A number of composite activity indices and response criteria have been devised to evaluate the treatment of inflammatory arthritis, some of them disease specific and others generic. A few, such as the Stanford Health Assessment Questionnaire (HAQ) (1), were developed for a specific diagnosis (for the HAQ, rheumatoid arthritis [RA]) but have also been applied in other diseases (2, 3), and by some the HAQ has even been suggested to represent a generic measure of function (4). Many activity indices, such as the Disease Activity Score (5) and Simplified Disease Activity Index (6), consist of patient- and evaluator-derived measures as well as of a laboratory measure of inflammation, whereas others are comprised solely of patient-derived data, such as the Patient Activity Scale or Rheumatoid Arthritis Impact of Disease score (7, 8). All of these are dependent both on inflammation and tissue damage (joint damage/erosions). In the case of the HAQ disability index, the relative importance of inflammation versus damage has been quantified in an RA cohort using randomized controlled trial (RCT) data (9).
From a biologic and theoretical standpoint, the separation of inflammation and tissue damage is pivotal to the understanding of function in inflammatory arthritis. Clinically, however, as well as from the patient perspective, a broader and “softer” concept of perceived health, such as HRQOL (10), may reflect important aspects of the disease process not covered by the usual activity and function measures. HRQOL may be measured directly by asking individuals to estimate their life quality relative to an ideal, perfect state by use of hypothetical or real scenarios such as standard gamble and time trade-off, or a visual analog scale (VAS) may be employed. Standard gamble and time trade-off appear to work better in situations such as surgery or terminal illness than in chronic disease (11). In chronic diseases, indirect methods involving questionnaires with health-related items are therefore widely used, for example, the EuroQol (EQ-5D) Index (12–14), the Short Form 6-dimension utility score (15), and the Health Utility Index (16). All of these instruments are generic.
Whereas utility values derived from standard gamble or time trade-off relate to the individual's perception of his or her health, the indirect instruments refer to a reference population, e.g., the general public. This may result in utilities better suited for health economic modeling. In a population sample, the questionnaire, e.g., the EQ-5D, is administered together with one of the direct quality of life instruments and the various health states defined by the former are calibrated with the latter. This valuation yields a “social tariff” for the indirect instrument by way of an algorithm involving, among other features, weighting of the various items. The tariff thus describes each of the valuated health states as a utility value assigned by the direct HRQOL measurements in the reference population. In principle, indirect HRQOL measures should be validated in each disease studied and also in the relevant population to account for cultural, socioeconomic, and other differences (17). The EQ-5D has been validated in a Swedish population sample (18), but there is no Swedish tariff. The weights of the UK tariff (19) employed in the current study are displayed in Table 1. An example of utility calculation from a health state is given in Table 2. There are also disease-specific HRQOL instruments, but they are not used for the calculation of utility and are thus outside the scope of this article. Which of the many HRQOL instruments should be chosen is largely dependent on the kind of investigation being performed (20).
Table 1. Weights for the various items of EQ-5D according to the UK tariff (N3 model)*
Adapted from Nan et al (40). The minimum clinically important difference in EuroQol (EQ-5D) Index utilities is considered to be ∼0.005 (42). Any problem in any item except “some problem” with usual activities will result in disutility.
1 if mobility is level 2; otherwise 0
1 if mobility is level 3; otherwise 0
1 if self-care is level 2; otherwise 0
1 if self-care is level 3; otherwise 0
1 if usual activities is level 2; otherwise 0
1 if usual activities is level 3; otherwise 0
1 if pain/discomfort is level 2; otherwise 0
1 if pain/discomfort is level 3; otherwise 0
1 if anxiety/depression is level 2; otherwise 0
1 if anxiety/depression is level 3; otherwise 0
1 if any dimension is level 3; otherwise 0
Table 2. Utility estimation of EuroQol (EQ-5D) Index health state 11223 (item 1, mobility, scored at level 1, etc.) using the UK valuation*
Utility may be regarded as a preference made by the patient (given a choice) scored between 0 (death) and 1 (perfect health). Health states worse than death may be assigned in the valuation process; the EQ-5D utility by the UK tariff may score down to −0.59 (17). By multiplying the time spent in a certain health state by its utility, one may calculate QALYs after certain assumptions (21). One year spent in a state with 0.5 utility, for example, yields 0.5 QALY. The utility gained by some intervention, e.g., tumor necrosis factor (TNF) blockade in inflammatory arthritis, is obtained as the difference between 2 time points, analogous to response for the activity indices. This difference, i.e., the ΔEQ-5D, multiplied by the time elapsed, yields the number of QALYs gained. A QALY may in turn be assigned a price that a funding source is considered willing to pay.
The aim of the present study was to introduce a new, utility-based outcome measure, the NNQ, as outlined below, and to apply it to a cohort of patients with RA, psoriatic arthritis (PsA), and spondylarthritis (SpA) who were treated with anti-TNF drugs in clinical practice. We also wanted to study possible secular trends in NNQ.
PATIENTS AND METHODS
We propose a new, simple measure, the NNQ, which is based on the utility and QALY concepts, as a group-level estimate of the degree to which an intervention is worthwhile from an HRQOL perspective. The NNQ is the number of patients that must be subjected to an intervention in order to gain 1 QALY, and it is calculated by multiplying the inverted value of utility gain (Δ value) and the time during which this gain takes place.
Utility (u) may be expressed as a function of time (t); u0, u1, u1.5, u3, etc., are utilities at baseline and time points 1 month, 1.5 months, and 3 months, etc., respectively, and Δu = u − u0. The QALY gain during 1 year is the area under the curve (AUC) of Δu = f(t). If QALY gain occurs immediately and remains relatively stable for 1 year, as we have demonstrated to be the case for RA, PsA, and SpA (22), the NNQ may be calculated as NNQ = 1/(u12 − u0). If, however, the utility gain is not immediate but more gradual or fluctuating, the denominator is substituted for the AUC:
where t = time in months. Thus, NNQ is the inverted value of the number of QALYs gained during 1 year. In practice, due to the immediate onset of treatment effect in TNF blockade in RA, PsA, and SpA, the NNQ in these diseases may be calculated from the mean QALY gain for each patient, assuming no spontaneous remissions and no effect on mortality over the time period under study. In other diagnoses or therapies, it may be more appropriate to use a formula that takes the distribution of utility measurements over time into account. The AUC thus is affected by the shape of the Δu = f(t) curve.
Patients with inflammatory arthritides in Southern Sweden who were starting a course of biologic treatment were entered into the South Swedish Arthritis Treatment Group register as detailed in previous publications (23, 24). The Ethics Committee of the Faculty of Medicine, Lund University, concluded that no approval was needed because of the safety and quality surveillance character of the registry.
Demographic data, treatment data, and starting date were collected at baseline in addition to core set outcome variables. The EQ-5D excluding the 20-cm global VAS scale was administered. Data were collected at 0, 0.5, 1.5, 3, 6, and 12 months. At the start of each treatment course, a checklist with the American College of Rheumatology (formerly the American Rheumatism Association) 1987 diagnostic criteria for RA (25), the modified New York criteria for ankylosing spondylitis (26), and the European Spondylarthropathy Study Group SpA criteria (27) was to be filled out.
To be eligible for the study, patients had to have a diagnosis of RA, PsA, or SpA as provided by the attending rheumatologist and start a treatment course with adalimumab, etanercept, or infliximab during the period January 2002 to December 2007. Data extraction was closed December 1, 2008. Complete data at baseline was mandatory. Treatment courses were assigned as being first, second, or third. In general, second and third treatment courses refer to patients also included as first course. During the study period, the availability of the 3 TNF blockers was variable, and they have thus been studied together. Because of limited numbers, secular trends were only studied in patients with RA starting their first anti-TNF drug.
In the present study, utility gain was calculated as the mean utility during the first year minus baseline utility for each treatment course. The AUC was also calculated in a subgroup with a sufficient number of EQ-5D observations according to the trapezium rule. The UK tariff was used (19, 28). Utility gain and NNQ were calculated 1) for all eligible treatment courses together, 2) for courses with a duration ≥1 year, and 3) for courses of <1 year of duration, separately for each diagnosis and treatment course order. Furthermore, all utility gain and NNQ values were calculated with and without correction for the actual duration of treatment. For uncorrected values, AUC was calculated as utility gain multiplied by 1 year, irrespective of real treatment duration (last observation carried forward [LOCF]); for the corrected values, utility gain was multiplied by the fraction of the year that treatment was actually given.
The NNQ concept and its simplified health economic estimates were also tested in an alternative RA cohort for which the results of a full incremental cost-effectiveness ratio (ICER) analysis have been published (29). Although this was not an ideal health economic study design because it did not compare 2 different treatments/interventions, it describes the large problems with early ICER studies after the introduction of new treatments.
Values are the means and 95% confidence intervals (95% CIs) unless otherwise indicated. The 2 different methods of AUC calculation were compared using Spearman's correlation coefficient. Only groups with >50 observations were used.
The baseline characteristics of the patients at first anti-TNF treatment are shown in Table 3. For RA, there were 1,001, 468, and 139 eligible first, second, and third treatment courses, respectively. The corresponding figures for PsA were 241, 86, and 17, respectively, and for SpA they were 255, 63, and 26, respectively. There were several differences regarding age, sex, disease duration, medication, and disease activity indices between the diagnostic entities. Also, HAQ scores differed, whereas patients' global VAS and EQ-5D utilities were similar.
Table 3. Baseline characteristics at initiation of first anti-TNF treatment*
RA (n = 1,001)
SpA (n = 255)
PsA (n = 241)
Values are the mean (95% confidence interval) unless otherwise indicated. Anti-TNF = anti–tumor necrosis factor; RA = rheumatoid arthritis; SpA = spondylarthritis; PsA = psoriatic arthritis; DMARDs = disease-modifying antirheumatic drugs; DAS28 = Disease Activity Score in 28 joints; CDAI = Clinical Disease Activity Index; VAS = visual analog scale; HAQ = Health Assessment Questionnaire; EQ-5D = EuroQol Index.
55.8 (55.0, 56.7)
43.7 (42.2, 45.1)
46.7 (45.1, 48.3)
Disease duration, years
10.9 (10.2, 11.6)
14.0 (12.5, 15.4)
10.6 (9.47, 11.8)
Previous DMARDs, no.
2.63 (2.54, 2.72)
1.53 (1.41, 1.64)
1.72 (1.59, 1.85)
Ongoing DMARDs, no.
0.92 (0.89, 0.96)
0.71 (0.63, 0.78)
0.81 (0.73, 0.88)
DAS28 score (0 to 10)
5.37 (5.30, 5.45)
3.76 (3.61, 3.91)
4.51 (4.33, 4.69)
CDAI score (0 to 100)
29.4 (28.6, 30.1)
15.8 (14.7, 16.8)
22.5 (21.0, 24.0)
VAS global (0 to 100)
61.6 (60.2, 62.9)
61.6 (59.0, 64.2)
61.5 (58.7, 64.2)
Evaluators global (Likert scale, 0 to 4)
2.28 (2.24, 2.32)
1.98 (1.89, 2.06)
2.04 (1.96, 2.12)
HAQ score (0 to 3)
1.20 (1.16, 1.23)
0.78 (0.70, 0.85)
0.89 (0.82, 0.96)
EQ-5D utility (−0.59 to 1)
0.40 (0.38, 0.42)
0.44 (0.40, 0.48)
0.40 (0.36, 0.44)
Clinical signs of disease, %
Spondylitis plus peripheral disease
The mean NNQ values for the various diagnoses and treatment course numbers are given in Table 4. In the analysis of all courses with EQ-5D data (LOCF approach), the NNQ for RA was 4.5, 6.4, and 5.3 for the first, second, and third treatment course, respectively, without time correction. The time-corrected NNQ values were slightly higher: 4.7, 6.7, and 5.7, respectively. For RA courses ≥1 year, NNQ values were 3.8, 5.4, and 4.4 (time correction not needed). Uncorrected NNQ for RA courses <1 year were 10.4, 12.6, and 9.6, respectively, whereas the corresponding corrected values were 16.5, 20.6, and 17.0, respectively. NNQ data for PsA and SpA are displayed in Table 4.
Table 4. Utility gain (ΔEQ-5D) and NNQ for first, second, and third anti-TNF course for RA, PsA, and SpA*
Data not given for <50 observations. Values are the mean (95% confidence interval). NNQ = number needed per quality-adjusted life year gained. See Table 3 for additional definitions.
To assess the correctness of utility AUC based on the mean of utility observations minus the baseline utility value, AUC was estimated by an alternative approach (the trapezium rule) in a subgroup (n = 696) of first-course RA treatment with a sufficient number of observations. The methods yielded similar results, and the correlation coefficient was found to be 0.99 (Figure 1).
To study possible secular trends regarding utility gain (ΔEQ-5D) during the study period, we also calculated NNQ per year of anti-TNF initiation (Figure 2). No obvious trends could be seen.
Our previous ICER analysis on patients with RA treated with anti-TNF drugs in 1999–2002 (29) resulted in an incremental calculated cost during the first treatment year of €12,184. However, they had a very high QALY gain of 0.37, and the corresponding cost per QALY was between €36,900 and €64,100, depending on when the QALY gain was supposed to take place. If improvement was to take place immediately, the QALY cost would have been €33,000. Translating this into the simplified NNQ concept, you must treat 2.7 patients to gain 1 QALY. With an annual drug cost of approximately €14,000 per patient, the total cost would be €37,800, i.e., close to the full ICER analyses.
In the current study, we propose a new measure, the NNQ, providing an easily understood estimate of one aspect (i.e., HRQOL) of the degree to which a treatment is worthwhile in a population. We found that the NNQ for TNF blockade of inflammatory arthritis in our setting was 4–6, irrespective of diagnosis or treatment course order.
The time-corrected NNQ values must be considered more reflective of actual utility gain in the data set under consideration. The optimal basis for utility gain calculation (all data, courses with complete data only, or courses <1 year) is not self-evident. In the observational setting, giving all 3 will provide the most complete information. Patients with >1 year of treatment constitute a selection of responders, and courses <1 year the opposite. The higher NNQ values of the latter reflect the much smaller utility gain compared with those remaining on therapy. Time-corrected NNQ based on all data may be considered to represent a tempered estimate of the treatment effect in the cohort as a whole.
When NNQ was corrected for the time of courses <1 year, values became slightly higher, as expected. NNQ based upon courses >1 year was, not surprisingly, lower than for all courses; this group represents patients continuing treatment, whereas those terminating treatment before 1 year conceivably, in many cases, represent treatment failure or toxicity in patients gaining fewer QALYs. This is supported by the considerably higher NNQ values found for those treated for <1 year, with a further increase after time correction, reflecting the small utility AUC gain in this group. However, NNQ based on all data rose only slightly after time correction. Thus, the treatment courses <1 year, in spite of constituting a sizable proportion (25%) of all treatments, affected the crude NNQ only to a minor degree. However, including all utility data without compensating for actual time receiving the drug tends to inflate utility gain. Of course, results were even more inflated when only courses >1 year were taken into account, similarly to the results of open-label extensions of RCTs.
In general, the time-corrected NNQ yielded very similar results as the NNQ (between 4 and 6) across diagnoses and treatment courses with overlapping 95% CIs. The only exception was a significantly higher NNQ of 6.7 for the second anti-TNF treatment course in patients with RA, suggesting a selection of cases less prone to improve health utilities after switching to a second anti-TNF drug. However, the pattern was neither reproduced in the other diagnostic entities nor in the third course patients, thus lending some doubt to the validity of this information. The results were more variable in the smaller groups, with wide 95% CIs, and we therefore chose to only study groups with numbers exceeding 50.
It is difficult to account for the cost of adverse events in health economic calculations. Life-table analyses reveal length of time spent on treatment, but withdrawal gives rise to costs represented by the smaller utility and QALY gain, reflected in higher NNQ values. Thus, NNQ includes information on the (utility) cost of adverse events, albeit not the whole truth (30).
The lack of secular trends for NNQ over time (Figure 2) was somewhat unexpected in view of clear trends regarding baseline characteristics in our setting during 1999–2007 (31). However, we did not find such trends for baseline EQ-5D utilities for the current time period 2002–2008 (22). Also, it must be remembered that NNQ represents change after intervention, which is not necessarily a function of baseline characteristics.
Utility development was studied for 1 year, but it is possible that the NNQ observed would remain valid for a longer period of time. Utility in those remaining on therapy tends to remain constant after the initial rise (22), and data from our (32, 33) and other (34) cohorts has shown that the number of patients terminating treatment tends to level out over time.
The NNQ concept, like the concept of utility, is not limited to rheumatology but should easily lend itself to a wide range of interventions and diagnostic entities. We have thus calculated a few examples of NNQ values based on published data. In a large British study of gastroesophageal reflux, surgery was found to produce a gain of 0.088 QALYs, equivalent to an NNQ of 11.4 (35), i.e., one has to operate on 11–12 patients to gain 1 QALY. Surgery for a herniated lumbar disc was, in a Swedish study, associated with a QALY gain of 0.41, corresponding to an NNQ of 2.4 (36). Both of these studies utilized EQ-5D utilities. In a health economic evaluation of valsartan for heart failure in patients who had had a myocardial infarction and were not suitable for angiotensin-converting enzyme inhibitors, the amount of incremental QALYs gained based on literature-derived utilities weighted for various cardiovascular events was found to be 0.5021, i.e., an NNQ of 1.99 (37).
In a retrospective, observational study of anti-TNF treatment in 297 patients with RA, PsA, and ankylosing spondylitis (AS), the number needed to treat (NNT) in order to gain at least the minimum clinically important difference in HAQ score was calculated by inverting the percentage of patients reaching this threshold (34). This is not the usual definition of NNT (38); the values found were 1.94, 1.88, and 2.30 for RA, PsA, and AS, respectively, based on ΔHAQ over 1 year. However, HAQ is not a utility measure, and the validity of this “NNT” metric appears somewhat unclear.
The concept “NNT to gain 1 QALY” was used in a health economic evaluation of orlistat in the treatment of overweight patients (39). No details regarding the calculation of this estimate are given, however, and the study is based on pooled data from 5 RCTs comparing calorie-reduced diet plus orlistat or placebo. The incremental cost per QALY gained is given, and it may thus be inferred that “NNT to gain 1 QALY” refers to the usual definition of NNT involving a placebo group. This is not the case with the NNQ described in the current study, in which the notion that patients remain in the pretreatment health state for the full year of followup if left untreated is an integral part of the QALY gain of the time-corrected NNQ.
In order to test the face validity of the NNQ, we applied the concept to a previously described RA cohort (29). Using the NNQ resulted in a roughly similar estimate of annual costs as the full ICER analysis.
There are limitations to our study. The NNQ may be regarded as an oversimplification. Like other measures, it is no better than the data from which it is derived, but in addition, it entails some approximations and assumptions that must be taken into account.
First, mean utility gain assumes a constant health state for 1 year. This could be amended by taking more measurements during the observation period and basing AUC calculations on all of these. The 2 AUC calculation methods employed in the present example, however, yielded very similar results. Baseline utility was also a single value, rather than being based on 2 observations some time apart. The second and third treatment courses in our setting, however, have baseline utilities roughly the same as that of the first course (22). This observation supports the reliability of the reported baseline utility values because patients tend to return to their original utility level upon anti-TNF cessation.
Second, there are inherent drawbacks to the utility measures as compared with the HAQ and other scales; many of them lack robustness in at least some respect. The EQ-5D represents a compromise exhibiting feasibility, acceptable responsiveness, and construct validity, but rather poor reliability (20). On the other hand, generic measures like EQ-5D seem to give more uniform results across diagnostic entities than the HAQ or disease activity measures, which are more dependent on inflammation. Furthermore, the transformation of the questionnaire raw scores into utility values has its pitfalls. We have used the UK valuation of the EQ-5D (28), which was made in the beginning of the 1990s, with a British, general population sample. It is possible that the preferences of a Swedish population 10 years later were different. However, there is no Swedish EQ-5D valuation. The item weights and algorithms of tariffs vary, and so do the utilities resulting from their respective application. By applying different tariffs in the same study, widely differing QALY gains and cost-effectiveness estimations may result (40, 41). Comparing utility levels, gains, and QALYs (and, consequently, NNQ) from various studies must be done carefully, taking the tariffs and algorithms used into account.
Finally, the term NNQ may sound too much like the NNT statistic commonly used in RCTs (38). The NNT is the inverted value of the absolute risk reduction, i.e., the difference between the absolute risk for a defined outcome in populations exposed to the intervention studied and a standard (or no) intervention, respectively. By contrast, the NNQ does not include a parallel contrast population. We assume that utility improvement is unlikely in patients with RA, PsA, and SpA who are not given TNF inhibitors. This may not be true in other diseases or therapies. Furthermore, in a clinical setting such as ours, both the intervention (3 anti-TNF drugs) and the study population (nonrandomized patients in day-to-day care) lack homogeneity, making it difficult to relate the NNQ values found to a well-defined treatment situation. The purpose of the study, however, was not just to investigate the effect of TNF blockers, but rather to present and test the feasibility and face validity of the NNQ concept.
The NNQ is a group-level measure giving an idea of the extent to which the intervention studied is worthwhile in a given population. It may thus be regarded as a much simplified health economic measure, partly integrating loss due to adverse events. It does not include the gathering of real costs, a burdensome and often precarious procedure including many assumptions. The sum spent on an anti-TNF drug approximates the incremental cost, and the gain is represented by the QALYs.
NNQ is easy to determine and follow. It does not necessitate the gathering of actual costs, and it does not preclude health economic modeling. The NNQ may help noneconomists, politicians, patients, and stakeholders understand important aspects of how interventions should be valued economically. Health economic studies are generally not very transparent, and they rarely show that the drug studied (from the sponsoring company) is not worth the expense. It should be possible to apply the NNQ in many settings. Validation in other cohorts, both in trials and in clinical practice, is called for, however, to determine its role in the armamentarium of outcome measures in rheumatology and other fields.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Gülfe had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Gülfe, Kristensen, Saxne, Petersson, Geborek.
Acquisition of data. Petersson, Geborek.
Analysis and interpretation of data. Gülfe, Kristensen, Saxne, Jacobsson, Petersson, Geborek.
The authors are indebted to all colleagues and staff in the South Swedish Arthritis Treatment Group for cooperation and data supply, to Jan-Åke Nilsson for help with statistical calculations, and to Martin Neovius for valuable suggestions.