Fax: (507) 266-2477
A pooled analysis of quality of life measures and adverse events data in north central cancer treatment group lung cancer clinical trials
Article first published online: 8 JAN 2007
Copyright © 2007 American Cancer Society
Volume 109, Issue 4, pages 787–795, 15 February 2007
How to Cite
Huschka, M. M., Mandrekar, S. J., Schaefer, P. L., Jett, J. R. and Sloan, J. A. (2007), A pooled analysis of quality of life measures and adverse events data in north central cancer treatment group lung cancer clinical trials. Cancer, 109: 787–795. doi: 10.1002/cncr.22444
- Issue published online: 2 FEB 2007
- Article first published online: 8 JAN 2007
- Manuscript Revised: 9 NOV 2006
- Manuscript Accepted: 9 NOV 2006
- Manuscript Received: 7 JUL 2006
- pooled analysis;
- quality of life;
- adverse events;
- lung cancer
In this pooled analysis, the authors examined correlations between single-item and multiple-item quality of life (QOL) measures and assessed the agreement between clinically significant changes in QOL and patient-reported adverse events (AE).
Data from 6 lung cancer clinical trials that involved 358 patients were pooled. All trials incorporated the Uniscale and 1 of 3 multiple-item assessments: the Functional Assessment for Cancer Therapy-Lung, the Lung Cancer Symptom Scale, or the Symptom Distress Scale. Spearman rank correlations and a Bland-Altman approach were used to assess agreement. Time-to-event analysis was performed using the Kaplan-Meier method.
Correlations between the Uniscale and multiple-item assessments were substantial (correlation coefficient = 0.49–0.66). At least 1 10-point decline was reported in the Uniscale and multiple-item assessments by 58% of patients and 39% of patients, respectively. At least 1 severe AE (grade ≥3) was reported in 35% of patients postbaseline. The percent agreement between experiencing a severe AE and a decline in QOL was 48% and 59% for the Uniscale and multiple-item assessments, respectively. The median time to the first 10-point decline in QOL for the Uniscale and multiple-item assessments was 67 days and 142 days, respectively, and the median time to the first occurrence of a severe AE was 304 days.
Information gleaned from the single-item Uniscale assessment was comparable to that gleaned from multiple-item global measures. There was moderate agreement between QOL and AE. A 10-point decline in QOL occurred earlier than Common Toxicity Criteria AE reporting. This suggests the need for inclusion of a QOL instrument in lung cancer clinical trials. Cancer 2007. © 2007 American Cancer Society.
Quality of life (QOL) can be assessed by using a multitude of instruments.1, 2 Brief instruments seem desirable, because patient burden contributes to missing data; however, there is a trade-off.3, 4 The merits of both the single-item and the multiple-item indices have been discussed previously.3 Much work has been done recently questioning the value of QOL data above and beyond the collection of adverse events (AE) data.5–7
A single-item index is 1 question that pertains to a singular, often global concept. An example of a single-item index is the Spitzer Uniscale.8 The Uniscale asks patients to rate their own QOL during the past week by placing an X within the bar and provides an easy guideline for the lowest and the highest quality. This bar in effect represents a scale from 0 to 100, with 0 denoting the worst possible QOL and 100 denoting the best possible QOL. The Spitzer Uniscale is analogous to other linear analogue self-assessment items that have been deemed valid and reliable in a wide variety of applications.9–15
Multiple-item indices are designed in different ways. Some multiple-item indices contain several stand-alone, single-item indices that also can be summated to achieve an overall assessment score. The Lung Cancer Symptom Scale (LCSS), for example, includes 9 distinct questions that pertain to lung cancer.16, 17 A total LCSS score can be obtained by summing the 9 individual scores. The Symptom Distress Scale (SDS) also is a multiple-item index that is a compilation of 12 separate Uniscale questions.18, 19 Other multiple-item indices have ≥1 constructs where each construct is comprised of several questions: for example, the Functional Assessment of Cancer Therapy for Lung Cancer (FACT-L).20 The FACT-L includes 4 separate constructs of well-being, namely, FACT-L Physical Well-Being, Social/Family Well-Being (SWB), Emotional Well-Being (EWB), and Functional Well-Being, as well as a fifth construct, Additional Concerns, that deals solely with tumor-related symptoms. The questions in each construct can be summated to obtain a single score.
AEs are recorded routinely in most oncology clinical trials by using a standardized taxonomy, such as the Common Toxicity Criteria (CTC) for AE developed by the National Cancer Institute.21 Despite standardization, it is known that AEs are collected only if the clinician actively asks the patient about a particular AE or if the patient spontaneously volunteers information that an AE has occurred. Furthermore, the classification system only records AE on an ordinal scale of none, mild, moderate, and severe or indicates the presence or absence of an AE. The degree of distress that an AE imparts to a patient is not collected routinely. The attribution of an AE to the treatment agent is another complicating factor that adds further measurement error to the estimation of the degree and amount of toxicity truly experienced by patients in oncology clinical trials.22 The clinical significance of these data, hence, are in question, especially when considering the perspective of patients who are undergoing toxic anticancer treatments.
The objective of this pooled analysis across a series of 6 North Central Cancer Treatment Group (NCCTG) lung cancer clinical trials was to explore the relations between single-item and multiple-item QOL measures as well as their relation to AEs. Specifically, we attempted to answer the following questions: 1) How strongly related are the results of a single-item assessment and a multiple-item, summated scale? 2) Are clinically significant changes detected more readily by either approach? 3) How do these 2 measures relate to AE data? In summary, we explored the relative differences in information gleaned from a single-item QOL measure and a multiple-item, summated scale in addition to their correlation with AEs.
MATERIALS AND METHODS
Six NCCTG lung cancer clinical trials were selected for this pooled analysis based on the criterion that the study included the Spitzer Uniscale and 1 multiple-item assessment (LCSS, FACT-L, or SDS). Table 1 shows that these studies spanned a broad spectrum of lung cancers (nonsmall cell lung cancer [NSCLC], small-cell lung cancer [SCLC], and pleural mesothelioma) and treatment phases (pilot, Phase II, and Phase III). The population for the pooled analysis included all eligible patients who completed both the Uniscale and a multiple-item assessment at baseline and completed ≥1 assessment postbaseline. Any reference made to the Uniscale below refers to the Spitzer Uniscale. Note that, in recent studies, the Uniscale has been modified from placing an X in a box to circling of a number from 0 to 10 without loss of validity or reliability.23
|Study no.||Description||Sample size (No. Evaluable)||Assessments||Assessment schedule|
|95-20-53||Pilot study of high-dose thoracic RT with concomitant cisplatin/etoposide in limited-stage SCLC||76 (52)||Uniscale, LCSS||Baseline, prior to irradiation, prior to last cycle and at 3-mo, 1-y, and 2-y follow-up visits|
|95-24-52||Phase II trial of edatrexate combined with vinblastine, doxorubicin, cisplatin, and filgrastim in patients with advanced NSCLC||34 (26)||Uniscale, FACT-L v3||Baseline and prior to each treatment cycle|
|97-24-51||Phase III randomized, double-blind study of CAI and placebo in advanced NSCLC||177 (122)||Uniscale, FACT-L v4||Baseline and monthly during course of treatment|
|98-24-52||Randomized Phase II study of docetaxel and gemcitabine for stage IIIB/IV NSCLC||99 (81)||Uniscale, LCCS||Baseline and prior to each treatment cycle|
|N0021||Phase II study of gemcitabine and epirubicin for the treatment of mesothelioma||68 (46)||Uniscale, SDS||Baseline, at each evaluation, and at 3-mo and 1-y follow-up visits|
|N0022||Oral vinorelbine for the treatment of metastatic NSCLC in patients aged ≥65 y: Phase II trial of efficacy, toxicity, and patients' perceived preference for oral therapy||58 (31)||Uniscale, LCSS||Baseline and immediately after completion of second cycle of chemotherapy|
The multiple-item indices were summated in the usual manner and then converted to a percentage of theoretical range from 0 to 100 to facilitate direct comparisons in which 100 was interpreted as the best possible scenario. A clinically significant decline (CSD) in QOL was defined as a 10-point decline from baseline on the scale from 0 to 100.23–26 The Uniscale and multiple-item indices were summarized by using descriptive statistics in the overall cohort as well as within 3 subgroups divided in terms of assessment completion: 1) LCSS and Uniscale, 2) FACT-L and Uniscale, and 3) SDS and Uniscale.
Spearman correlations and the Bland-Altman approach were used to assess the strength of linear associations and agreement between the Uniscale and multiple-item assessments.27–29 The criteria published by Cohen were used for interpreting the size of a correlation; specifically, correlations from 0.10 to 0.29 were considered low, correlations from 0.30 to 0.49 were considered moderate, and correlations >0.5 were considered high.30
Graphs using intrapatient scores were created for visual representation. A Bland-Altman plot was used to assess the differences in 2 scales across the entire range of possible scores. When 2 indices were correlated highly with little variability, a 1-to-1 linear relation will appear. When more variability is identified between 2 indices, a more scattered arrangement of scores will appear. A bug-plot is used to show side-by-side comparisons over time for 2 instruments with the advantage of presenting all patient data in 1 simple display.
AE event monitoring and reporting was based on the CTC (version 2.0). Because attribution was not captured routinely in all trials, for the current analysis, we used all AEs regardless of attribution. A severe AE was defined as an AE with a grade of 3, 4, or 5. The AEs that were selected for the current analysis were those reported postbaseline by 2% of patients according to the CTC that also could be assessed through a QOL instrument. They included alopecia, anorexia, constipation, diarrhea, dyspnea, fatigue, nausea, neurosensory, and vomiting. Although anemia, neutropenia, leukopenia, and thrombocytopenia met the defined 2% selection criteria, these AEs were excluded, because they could not be identified explicitly through a patient self-report. A CSD in AE was defined as a grade 0, 1, or 2 AE reported at baseline that subsequently changed to a grade 3, 4, or 5 AE at least once postbaseline. (The corollary was examined to find only 1 clinically significant improvement from baseline; thus, these results are not reported.)
For comparisons between the QOL indices and AEs, the percent agreement was calculated. A time to first event analysis was performed by using the Kaplan- Meier method for both QOL indices and AE data. The time to first event analysis of individual QOL questions on fatigue, anorexia, dyspnea, and nausea was limited to the LCSS and SDS. All analyses were carried out using SAS (version 8), a statistical programming tool, on the data available for the respective comparisons.
Of all 512 eligible patients, 468 patients had completed both the Uniscale and a multiple-item assessment at baseline. In total, 386 patients completed at ≥1 Uniscale and/or multiple-item assessment postbaseline. Therefore, the current pooled analysis included 358 eligible patients who completed both the Uniscale and a multiple-item assessment at baseline and at least once postbaseline.
Table 2 shows the overall and individual patient characteristics. The median patient age was 66 years, 91% of patients were Caucasian, and 61% of patients were men (with some variability across individual trials). Trial eligibility criteria required an Eastern Cooperative Oncology Group performance status ≤2. The majority of patients (90%) had died at the time of this analysis.
|Characteristic||No. of patients (%)|
|95-20-53, n = 52||95-24-51, n = 26||97-24-51, n = 122||98-24-52, n = 81||N0021, n= 46||N0022, n = 31||Total, N = 358|
|Men||29 (56)||11 (42)||67 (55)||52 (64)||40 (87)||20 (65)||219 (61)|
|White||50 (96)||24 (92)||111 (91)||76 (94)||37 (80)||28 (90)||326 (91)|
Baseline QOL scores are summarized in Table 3. The overall median baseline scores across the 6 trials were 82 for the Uniscale and 77 for the multiple-item assessment. Overall, the Uniscale showed a greater range of scores than the multiple-item assessment (0–100 vs 28–100). The FACT-L and the LCSS, with their corresponding Uniscale assessments, demonstrated comparable mean scores (76 vs 75 and 73 vs 76, respectively). The mean SDS percentage of theoretical range score and the mean Uniscale scores differed by 10 points, suggesting that the 2 assessments may be different at baseline (77 vs 66, respectively). Correlations between the Uniscale and the multiple-item assessment, FACT-L, LCSS, and SDS were moderate (correlation coefficient [ρ] = 0.48, ρ = 0.42, and ρ = 0.40, respectively). The FACT-L SWB and EWB subscales had a low correlation with the Uniscale at baseline (ρ = 0.16 and ρ = 0.19, respectively).
|Assessment||FACT-L and Uniscale, n = 148||LCSS and Uniscale, n = 164||SDS and Uniscale, n = 46||Total, N = 358|
|Mean (SD)||75 (20)||76 (24)||66 (23)||74 (23)|
|Mean (SD)||76 (12)||73 (15)||77 (14)||75 (14)|
The median number of assessments completed for both the Uniscale and the multiple-item assessments was 2 (range, 1–24 assessments). The overall median QOL scores were 75 for the Uniscale and 76 for the multiple-items. There were differences in the range of QOL scores among patients who completed both the Uniscale and the FACT-L (0–97 vs 31–99, respectively) and patients who completed both the Uniscale and the SDS (4–97 vs 39–96, respectively). Correlations between the Uniscale and the FACT-L, the LCSS, and the SDS were 0.66, 0.57, and 0.49, respectively.
Change in QOL From Baseline
The Uniscale was more likely to detect a CSD in QOL from baseline than the multiple-item assessments (58% vs 39%). Agreement between the Uniscale and the FACT-L, LCSS, and SDS in detecting a CSD was 56%, 59%, and 71%, respectively (Table 4). The Uniscale and multiple-item assessments were similar in detecting clinically significant increases (27% vs 26%, respectively).
|Assessment||No. of patients (%)|
|FACT-L and Uniscale, n = 120||LCSS and Uniscale, n = 152||SDS and Uniscale, n = 45||Total, N = 317|
|Uniscale||73 (61)||91 (60)||20 (44)||184 (58)|
|Multiple item||46 (38)||66 (43)||13 (29)||125 (39)|
The Bland-Altman plot displays the difference of the 2 indices, Uniscale and LCSS, plotted against the average of the scores from the 2 indices (Fig. 1). This plot reiterates the greater variability in the Uniscale scores. In particular, when QOL is high, the Uniscale score is higher than the LCSS score; and, when QOL is low, the Uniscale score is lower than the LCSS score. Similar trends in the Bland-Altman plots were noted for the Uniscale and FACT-L assessments and for the Uniscale and SDS assessments (data not shown).
Figure 2 shows individual patient data of the Uniscale and LCSS over time for the 2 scales in a mirrored fashion. The nonsymmetrical appearance of the figure demonstrates the greater variability in the Uniscale scores relative to the LCSS scores. Similar patterns were identified for comparisons of the Uniscale with the FACT-L and the SDS (data not shown).
AE Categories and QOL
Ninety-three percent of the patients experienced ≥1 of the 9 selected AE categories postbaseline. However, only 35% of this population reportedly had ≥1 severe AE. The percent agreement between experiencing a severe AE and CSD in Uniscale scores was 48% (range, 46–51%). For severe AEs and CSD in multiple-item assessment scores, the overall agreement was 59% (range, 53–64%; for more details, see Table 5).
|Assessment||No. of patients (%)|
|FACT-L and Uniscale||LCSS and Uniscale||SDS and Uniscale||Total|
|Severe AE||26 (21)||74 (48)||17 (37)||117 (36)|
|CSD in Uniscale||74 (61)||92 (59)||20 (44)||186 (58)|
|Severe AE||30 (21)||76 (49)||17 (38)||123 (36)|
|CSD in multiple items||52 (37)||67 (43)||13 (29)||132 (39)|
The agreement between the CSD in AEs and the CSD in QOL assessment scores are shown in Table 6. There seemed to be greater agreement between the multiple-item assessments and the individual AEs relative to the single-item assessments. The percent agreements ranged from 37% to 52% for the Uniscale and from 44% to 74% for the multiple-item assessments. The AEs that showed least agreement with both the Uniscale and multiple-items was anorexia (37% and 44%, respectively). The AE that had greatest agreement with QOL indices was constipation (52% and 74%, respectively).
|Assessment||No. of patients (%)|
|CSD in AE||2 (1)||8 (9)||6 (9)||13 (18)||43 (28)||42 (19)||34 (16)||9 (5)||23 (16)|
|CSD Uniscale||75 (54)||61 (65)||37 (54)||44 (60)||90 (58)||139 (62)||118 (57)||116 (61)||72 (51)|
|CSD in AE||2 (1)||9 (9)||6 (8)||15 (20)||43 (27)||45 (19)||33 (15)||11 (6)||23 (15)|
|CSD in multiple item||59 (41)||50 (51)||17 (24)||31 (40)||67 (42)||96 (40)||84 (39)||73 (36)||60 (40)|
Time to Event
The median time to the first CSD in QOL for the Uniscale and multiple-item assessments were 67 days and 142 days, respectively, and the median time to the first occurrence of a severe AE was 304 days (Fig. 3). A similar analysis of the time to first CSD on the LCSS for anorexia, dyspnea, and fatigue yielded a median time of 140 days, 142 days, and 81 days, respectively. The median time to first CSD on the SDS for anorexia, dyspnea, fatigue, and nausea was 84 days, 199 days, 52 days, and 87 days, respectively. Kaplan-Meier curves for the time to the first occurrence of severe fatigue, as measured by the CTC and LCSS and by the CTC and SDS, are shown in Figure 4. The Kaplan-Meier curves for anorexia, dyspnea, and nausea were similar (data not shown). These results suggest that the QOL indices detected a CSD earlier than the CTC AE reporting.
The current pooled analysis of NCCTG lung cancer trials, which involved a variety of lung cancer types, examined the relation between single-item and multiple-item QOL measures and assessed the agreement between CSDs in QOL and patient-reported AEs. If the single-item and multiple-item assessments are strongly related, then it may not be necessary to have a patient complete a longer questionnaire when a single-item would suffice. However, there is a trade-off between the additional details and the patient burden that needs careful consideration.
Baseline QOL data suggest that, although the mean and median scores are similar across the 4 assessments, the Uniscale displays greater variability in the range of scores, particularly on the lower end of the scale. Longitudinal comparisons provide further evidence to support a recurring theme observed in such head-to-head comparisons that the Uniscale demonstrates greater variability in the overall range of scores than the multiple-item indices.2, 14, 23 More noteworthy in terms of clinical relevance, the pooled analysis suggests that the Uniscale detects a CSD better over time. This probably is caused in part because the Uniscale demonstrates greater variability in score range. Whether this greater variability is caused by increased sensitivity of measurement error (ie, noise) remains a subject of debate. It is noteworthy, but probably not surprising, that there is greater agreement between multiple-item and CTC AE assessments compared with single-item assessments. This potentially is explained by the finding that multiple-item assessments generally are symptom-specific and are aligned more closely to the nature of the CTC, whereas the Uniscale is a global measure of QOL.
Results from many NCCTG investigations bring us to the conclusion that CTC AE data clearly are measuring something different than what is measured by QOL assessments. In a recent NCCTG study, it was demonstrated that relying on AE data to detect peripheral neuropathy resulted in missing the incidence in >50% of patients.31 Thus, the clinical significance of AEs is brought into question. Despite the routine nature of the CTC AE reporting system, under-reporting is an ongoing problem, as evidenced by the median time to first occurrence of a severe AE in this analysis. There also is some indication that QOL assessments detect occurrences of severe AE earlier than those reported by CTC. Even among those who had an AE report, single-item QOL assessments detected a patient-perceived problem in peripheral neuropathy >6 weeks earlier than CTC criteria.31
QOL data have the advantage of including the patient's perspective on AE assessments. It also ensures that every patient is assessed for every potential treatment- or disease-related complication. It is important that clinicians respond to a patient-reported, clinically significant change in their QOL and not rely entirely on the CTC AE categories. However, to be fair, the CTC AE reporting system was not designed to report patient-perceived problems. Rather, it was intended to identify incidences of severe AE observed by clinicians that would indicate a need for immediate medical intervention.21 Hence, it is not surprising that QOL assessments that are intended to detect problems from the patient's perspective would be more sensitive. The CTC criteria serve a vital purpose in detecting extreme AE incidences that may cause clinical trials to be shut down. The fact that a patient does not experience a physician-based, AE-reportable incident, however, is not the same thing as detecting a problem that the patient perceives as needing intervention. QOL assessments can serve a supportive and supplementary role to ensure that any problem experienced by patients on clinical trials are detected and treated as soon as possible.
A few limitations of this pooled analysis are worth noting. This analysis included only 1 disease site (ie, lung), and the majority of patients (91%) reported their race as Caucasian; therefore, the current results may not be representative of other disease sites and patient populations. Also, the 6 studies were selected from a large database of NCCTG lung cancer clinical trials based on QOL instrumentation and, thus, have the potential for biased results. Finally, because of the retrospective nature of this analysis, the timing of assessments and the collection of AE data is not standardized across trials. Prospective assessment and validation in large, Phase III trials is needed to overcome these limitations and to establish routine clinical use.
More research is needed into the real-time feedback of clinically meaningful changes in patient-reported QOL and a delineation of the clinical pathways that need to be followed when such indications manifest. The use of modern technologic advances, such as personal data-acquisition devices and interactive voice-response systems, can make the real-time collection of QOL assessment data feasible and practical.32 The evidence to date, however, indicates that paper and pen assessment can be just as accurate as computer-based assessments.33 Further work needs to be done so that we can understand and optimize the application of such technology.
The current pooled analysis confirmed the hypothesis that the single-item assessment and the multiple-item indices indeed are related, but not to the point of pseudoredundancy. These results were consistent with the literature, indicating that a single-item index will capture a clinically significant change in QOL more readily than a multiple-item index. Our analysis also showed that measuring QOL differs quantifiably from gathering AE data. However, the primary motivating factor behind using a single-item index, a multiple-item index, or both always should be the research question under investigation.
- 22Should attribution be considered when interpreting adverse event data: a North Central Cancer Treatment Group (NCCTG) evaluation of a Phase III placebo controlled trial. [abstract]. J Clin Oncol. 2006; 24(18S): 6006., , , et al.
- 24Detecting worms, ducks and elephants: a simple approach for defining clinically relevant effects in quality-of-life measures. J Cancer Integrative Med. 2003; 1: 41–47., , , et al.
- 29What is the value added of patient reported outcomes relative to physician rated symptom assessments? [abstract]. J Clin Oncol. 2006; 24(18S): 8580., , , , , .
- 30Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawerence Earlbaum Associates; 1988..
- 31A comparison of simple single-item measures and the common toxicity criteria in detecting the onset of oxaliplatin-induced peripheral neuropathy in patients with colorectal cancer. [abstract]. J Clin Oncol. 2005; 23(16S): 8087., , , et al.
- 34A Phase II study of high dose twice-daily thoracic radiation therapy with concomitant cisplatin/etoposide in limited stage small cell lung cancer. [abstract]. Proc Am Soc Clin Oncol. 2001; 20. Abstract 1260., , , et al.
- 36A Phase III randomized placebo controlled NCCTG trial of carboxyaminoimidazole (CAI) in patients with advanced non-small cell lung cancer. [abstract]. J Clin Oncol. 2005; 23(16S): 7054., , , et al.
- 37Gemcitabine and epirubicin in patients with malignant pleural mesothelioma. [abstract]. J Clin Oncol. 2005; 23(16S): 7264., , , et al.