In this pooled analysis, the authors examined correlations between single-item and multiple-item quality of life (QOL) measures and assessed the agreement between clinically significant changes in QOL and patient-reported adverse events (AE).
Data from 6 lung cancer clinical trials that involved 358 patients were pooled. All trials incorporated the Uniscale and 1 of 3 multiple-item assessments: the Functional Assessment for Cancer Therapy-Lung, the Lung Cancer Symptom Scale, or the Symptom Distress Scale. Spearman rank correlations and a Bland-Altman approach were used to assess agreement. Time-to-event analysis was performed using the Kaplan-Meier method.
Correlations between the Uniscale and multiple-item assessments were substantial (correlation coefficient = 0.49–0.66). At least 1 10-point decline was reported in the Uniscale and multiple-item assessments by 58% of patients and 39% of patients, respectively. At least 1 severe AE (grade ≥3) was reported in 35% of patients postbaseline. The percent agreement between experiencing a severe AE and a decline in QOL was 48% and 59% for the Uniscale and multiple-item assessments, respectively. The median time to the first 10-point decline in QOL for the Uniscale and multiple-item assessments was 67 days and 142 days, respectively, and the median time to the first occurrence of a severe AE was 304 days.
Quality of life (QOL) can be assessed by using a multitude of instruments.1, 2 Brief instruments seem desirable, because patient burden contributes to missing data; however, there is a trade-off.3, 4 The merits of both the single-item and the multiple-item indices have been discussed previously.3 Much work has been done recently questioning the value of QOL data above and beyond the collection of adverse events (AE) data.5–7
A single-item index is 1 question that pertains to a singular, often global concept. An example of a single-item index is the Spitzer Uniscale.8 The Uniscale asks patients to rate their own QOL during the past week by placing an X within the bar and provides an easy guideline for the lowest and the highest quality. This bar in effect represents a scale from 0 to 100, with 0 denoting the worst possible QOL and 100 denoting the best possible QOL. The Spitzer Uniscale is analogous to other linear analogue self-assessment items that have been deemed valid and reliable in a wide variety of applications.9–15
Multiple-item indices are designed in different ways. Some multiple-item indices contain several stand-alone, single-item indices that also can be summated to achieve an overall assessment score. The Lung Cancer Symptom Scale (LCSS), for example, includes 9 distinct questions that pertain to lung cancer.16, 17 A total LCSS score can be obtained by summing the 9 individual scores. The Symptom Distress Scale (SDS) also is a multiple-item index that is a compilation of 12 separate Uniscale questions.18, 19 Other multiple-item indices have ≥1 constructs where each construct is comprised of several questions: for example, the Functional Assessment of Cancer Therapy for Lung Cancer (FACT-L).20 The FACT-L includes 4 separate constructs of well-being, namely, FACT-L Physical Well-Being, Social/Family Well-Being (SWB), Emotional Well-Being (EWB), and Functional Well-Being, as well as a fifth construct, Additional Concerns, that deals solely with tumor-related symptoms. The questions in each construct can be summated to obtain a single score.
AEs are recorded routinely in most oncology clinical trials by using a standardized taxonomy, such as the Common Toxicity Criteria (CTC) for AE developed by the National Cancer Institute.21 Despite standardization, it is known that AEs are collected only if the clinician actively asks the patient about a particular AE or if the patient spontaneously volunteers information that an AE has occurred. Furthermore, the classification system only records AE on an ordinal scale of none, mild, moderate, and severe or indicates the presence or absence of an AE. The degree of distress that an AE imparts to a patient is not collected routinely. The attribution of an AE to the treatment agent is another complicating factor that adds further measurement error to the estimation of the degree and amount of toxicity truly experienced by patients in oncology clinical trials.22 The clinical significance of these data, hence, are in question, especially when considering the perspective of patients who are undergoing toxic anticancer treatments.
The objective of this pooled analysis across a series of 6 North Central Cancer Treatment Group (NCCTG) lung cancer clinical trials was to explore the relations between single-item and multiple-item QOL measures as well as their relation to AEs. Specifically, we attempted to answer the following questions: 1) How strongly related are the results of a single-item assessment and a multiple-item, summated scale? 2) Are clinically significant changes detected more readily by either approach? 3) How do these 2 measures relate to AE data? In summary, we explored the relative differences in information gleaned from a single-item QOL measure and a multiple-item, summated scale in addition to their correlation with AEs.
MATERIALS AND METHODS
Six NCCTG lung cancer clinical trials were selected for this pooled analysis based on the criterion that the study included the Spitzer Uniscale and 1 multiple-item assessment (LCSS, FACT-L, or SDS). Table 1 shows that these studies spanned a broad spectrum of lung cancers (nonsmall cell lung cancer [NSCLC], small-cell lung cancer [SCLC], and pleural mesothelioma) and treatment phases (pilot, Phase II, and Phase III). The population for the pooled analysis included all eligible patients who completed both the Uniscale and a multiple-item assessment at baseline and completed ≥1 assessment postbaseline. Any reference made to the Uniscale below refers to the Spitzer Uniscale. Note that, in recent studies, the Uniscale has been modified from placing an X in a box to circling of a number from 0 to 10 without loss of validity or reliability.23
Table 1. Description of North Central Cancer Treatment Group Lung Cancer Trials*
Sample size (No. Evaluable)
RT indicates radiotherapy; SCLC, small cell lung cancer; LCSS, Lung Cancer Symptom Scale; NSCLC, nonsmall cell lung cancer; FACT-L v3, Functional Assessment of Cancer Therapy for Lung Cancer, version 3; CAI, carboxyaminoimidazole; FACT-L v4, Functional Assessment of Cancer Therapy for Lung Cancer, version 4; SDS, Symptom Distress Scale.
See Bonner et al, 200134; Colon-Otero et al, 200135 Johnson et al, 200536; Okuno et al, 200537; and Kanard et al, 2004.38
Pilot study of high-dose thoracic RT with concomitant cisplatin/etoposide in limited-stage SCLC
Baseline, prior to irradiation, prior to last cycle and at 3-mo, 1-y, and 2-y follow-up visits
Phase II trial of edatrexate combined with vinblastine, doxorubicin, cisplatin, and filgrastim in patients with advanced NSCLC
Uniscale, FACT-L v3
Baseline and prior to each treatment cycle
Phase III randomized, double-blind study of CAI and placebo in advanced NSCLC
Uniscale, FACT-L v4
Baseline and monthly during course of treatment
Randomized Phase II study of docetaxel and gemcitabine for stage IIIB/IV NSCLC
Baseline and prior to each treatment cycle
Phase II study of gemcitabine and epirubicin for the treatment of mesothelioma
Baseline, at each evaluation, and at 3-mo and 1-y follow-up visits
Oral vinorelbine for the treatment of metastatic NSCLC in patients aged ≥65 y: Phase II trial of efficacy, toxicity, and patients' perceived preference for oral therapy
Baseline and immediately after completion of second cycle of chemotherapy
The multiple-item indices were summated in the usual manner and then converted to a percentage of theoretical range from 0 to 100 to facilitate direct comparisons in which 100 was interpreted as the best possible scenario. A clinically significant decline (CSD) in QOL was defined as a 10-point decline from baseline on the scale from 0 to 100.23–26 The Uniscale and multiple-item indices were summarized by using descriptive statistics in the overall cohort as well as within 3 subgroups divided in terms of assessment completion: 1) LCSS and Uniscale, 2) FACT-L and Uniscale, and 3) SDS and Uniscale.
Spearman correlations and the Bland-Altman approach were used to assess the strength of linear associations and agreement between the Uniscale and multiple-item assessments.27–29 The criteria published by Cohen were used for interpreting the size of a correlation; specifically, correlations from 0.10 to 0.29 were considered low, correlations from 0.30 to 0.49 were considered moderate, and correlations >0.5 were considered high.30
Graphs using intrapatient scores were created for visual representation. A Bland-Altman plot was used to assess the differences in 2 scales across the entire range of possible scores. When 2 indices were correlated highly with little variability, a 1-to-1 linear relation will appear. When more variability is identified between 2 indices, a more scattered arrangement of scores will appear. A bug-plot is used to show side-by-side comparisons over time for 2 instruments with the advantage of presenting all patient data in 1 simple display.
AE event monitoring and reporting was based on the CTC (version 2.0). Because attribution was not captured routinely in all trials, for the current analysis, we used all AEs regardless of attribution. A severe AE was defined as an AE with a grade of 3, 4, or 5. The AEs that were selected for the current analysis were those reported postbaseline by 2% of patients according to the CTC that also could be assessed through a QOL instrument. They included alopecia, anorexia, constipation, diarrhea, dyspnea, fatigue, nausea, neurosensory, and vomiting. Although anemia, neutropenia, leukopenia, and thrombocytopenia met the defined 2% selection criteria, these AEs were excluded, because they could not be identified explicitly through a patient self-report. A CSD in AE was defined as a grade 0, 1, or 2 AE reported at baseline that subsequently changed to a grade 3, 4, or 5 AE at least once postbaseline. (The corollary was examined to find only 1 clinically significant improvement from baseline; thus, these results are not reported.)
For comparisons between the QOL indices and AEs, the percent agreement was calculated. A time to first event analysis was performed by using the Kaplan- Meier method for both QOL indices and AE data. The time to first event analysis of individual QOL questions on fatigue, anorexia, dyspnea, and nausea was limited to the LCSS and SDS. All analyses were carried out using SAS (version 8), a statistical programming tool, on the data available for the respective comparisons.
Of all 512 eligible patients, 468 patients had completed both the Uniscale and a multiple-item assessment at baseline. In total, 386 patients completed at ≥1 Uniscale and/or multiple-item assessment postbaseline. Therefore, the current pooled analysis included 358 eligible patients who completed both the Uniscale and a multiple-item assessment at baseline and at least once postbaseline.
Table 2 shows the overall and individual patient characteristics. The median patient age was 66 years, 91% of patients were Caucasian, and 61% of patients were men (with some variability across individual trials). Trial eligibility criteria required an Eastern Cooperative Oncology Group performance status ≤2. The majority of patients (90%) had died at the time of this analysis.
Table 2. Patient Characteristics
No. of patients (%)
95-20-53, n = 52
95-24-51, n = 26
97-24-51, n = 122
98-24-52, n = 81
N0021, n= 46
N0022, n = 31
Total, N = 358
Baseline QOL scores are summarized in Table 3. The overall median baseline scores across the 6 trials were 82 for the Uniscale and 77 for the multiple-item assessment. Overall, the Uniscale showed a greater range of scores than the multiple-item assessment (0–100 vs 28–100). The FACT-L and the LCSS, with their corresponding Uniscale assessments, demonstrated comparable mean scores (76 vs 75 and 73 vs 76, respectively). The mean SDS percentage of theoretical range score and the mean Uniscale scores differed by 10 points, suggesting that the 2 assessments may be different at baseline (77 vs 66, respectively). Correlations between the Uniscale and the multiple-item assessment, FACT-L, LCSS, and SDS were moderate (correlation coefficient [ρ] = 0.48, ρ = 0.42, and ρ = 0.40, respectively). The FACT-L SWB and EWB subscales had a low correlation with the Uniscale at baseline (ρ = 0.16 and ρ = 0.19, respectively).
Table 3. Summary of Baseline Quality of Life
FACT-L and Uniscale, n = 148
LCSS and Uniscale, n = 164
SDS and Uniscale, n = 46
Total, N = 358
FACT-L indicates Functional Assessment of Cancer Therapy for Lung Cancer; LCSS, Lung Cancer Symptom Scale; SDS, Symptom Distress Scale; SD, standard deviation.
The median number of assessments completed for both the Uniscale and the multiple-item assessments was 2 (range, 1–24 assessments). The overall median QOL scores were 75 for the Uniscale and 76 for the multiple-items. There were differences in the range of QOL scores among patients who completed both the Uniscale and the FACT-L (0–97 vs 31–99, respectively) and patients who completed both the Uniscale and the SDS (4–97 vs 39–96, respectively). Correlations between the Uniscale and the FACT-L, the LCSS, and the SDS were 0.66, 0.57, and 0.49, respectively.
Change in QOL From Baseline
The Uniscale was more likely to detect a CSD in QOL from baseline than the multiple-item assessments (58% vs 39%). Agreement between the Uniscale and the FACT-L, LCSS, and SDS in detecting a CSD was 56%, 59%, and 71%, respectively (Table 4). The Uniscale and multiple-item assessments were similar in detecting clinically significant increases (27% vs 26%, respectively).
Table 4. Clinically Significant Declines in Quality-of-Life Assessment Scores from Baseline
No. of patients (%)
FACT-L and Uniscale, n = 120
LCSS and Uniscale, n = 152
SDS and Uniscale, n = 45
Total, N = 317
FACT-L indicates Functional Assessment of Cancer Therapy for Lung Cancer; LCSS, Lung Cancer Symptom Scale; SDS, Symptom Distress Scale.
The Bland-Altman plot displays the difference of the 2 indices, Uniscale and LCSS, plotted against the average of the scores from the 2 indices (Fig. 1). This plot reiterates the greater variability in the Uniscale scores. In particular, when QOL is high, the Uniscale score is higher than the LCSS score; and, when QOL is low, the Uniscale score is lower than the LCSS score. Similar trends in the Bland-Altman plots were noted for the Uniscale and FACT-L assessments and for the Uniscale and SDS assessments (data not shown).
Figure 2 shows individual patient data of the Uniscale and LCSS over time for the 2 scales in a mirrored fashion. The nonsymmetrical appearance of the figure demonstrates the greater variability in the Uniscale scores relative to the LCSS scores. Similar patterns were identified for comparisons of the Uniscale with the FACT-L and the SDS (data not shown).
AE Categories and QOL
Ninety-three percent of the patients experienced ≥1 of the 9 selected AE categories postbaseline. However, only 35% of this population reportedly had ≥1 severe AE. The percent agreement between experiencing a severe AE and CSD in Uniscale scores was 48% (range, 46–51%). For severe AEs and CSD in multiple-item assessment scores, the overall agreement was 59% (range, 53–64%; for more details, see Table 5).
Table 5. Severe (Grade ≥3) Adverse Events and Clinically Significant Declines in Quality of Life
No. of patients (%)
FACT-L and Uniscale
LCSS and Uniscale
SDS and Uniscale
FACT-L indicates Functional Assessment of Cancer Therapy for Lung Cancer; LCSS, Lung Cancer Symptom Scale; SDS, Symptom Distress Scale; AE, adverse event; CSD, clinically significant decline.
CSD in Uniscale
CSD in multiple items
The agreement between the CSD in AEs and the CSD in QOL assessment scores are shown in Table 6. There seemed to be greater agreement between the multiple-item assessments and the individual AEs relative to the single-item assessments. The percent agreements ranged from 37% to 52% for the Uniscale and from 44% to 74% for the multiple-item assessments. The AEs that showed least agreement with both the Uniscale and multiple-items was anorexia (37% and 44%, respectively). The AE that had greatest agreement with QOL indices was constipation (52% and 74%, respectively).
Table 6. Clinically Significant Decline in Adverse Events and Overall Quality of Life
The median time to the first CSD in QOL for the Uniscale and multiple-item assessments were 67 days and 142 days, respectively, and the median time to the first occurrence of a severe AE was 304 days (Fig. 3). A similar analysis of the time to first CSD on the LCSS for anorexia, dyspnea, and fatigue yielded a median time of 140 days, 142 days, and 81 days, respectively. The median time to first CSD on the SDS for anorexia, dyspnea, fatigue, and nausea was 84 days, 199 days, 52 days, and 87 days, respectively. Kaplan-Meier curves for the time to the first occurrence of severe fatigue, as measured by the CTC and LCSS and by the CTC and SDS, are shown in Figure 4. The Kaplan-Meier curves for anorexia, dyspnea, and nausea were similar (data not shown). These results suggest that the QOL indices detected a CSD earlier than the CTC AE reporting.
The current pooled analysis of NCCTG lung cancer trials, which involved a variety of lung cancer types, examined the relation between single-item and multiple-item QOL measures and assessed the agreement between CSDs in QOL and patient-reported AEs. If the single-item and multiple-item assessments are strongly related, then it may not be necessary to have a patient complete a longer questionnaire when a single-item would suffice. However, there is a trade-off between the additional details and the patient burden that needs careful consideration.
Baseline QOL data suggest that, although the mean and median scores are similar across the 4 assessments, the Uniscale displays greater variability in the range of scores, particularly on the lower end of the scale. Longitudinal comparisons provide further evidence to support a recurring theme observed in such head-to-head comparisons that the Uniscale demonstrates greater variability in the overall range of scores than the multiple-item indices.2, 14, 23 More noteworthy in terms of clinical relevance, the pooled analysis suggests that the Uniscale detects a CSD better over time. This probably is caused in part because the Uniscale demonstrates greater variability in score range. Whether this greater variability is caused by increased sensitivity of measurement error (ie, noise) remains a subject of debate. It is noteworthy, but probably not surprising, that there is greater agreement between multiple-item and CTC AE assessments compared with single-item assessments. This potentially is explained by the finding that multiple-item assessments generally are symptom-specific and are aligned more closely to the nature of the CTC, whereas the Uniscale is a global measure of QOL.
Results from many NCCTG investigations bring us to the conclusion that CTC AE data clearly are measuring something different than what is measured by QOL assessments. In a recent NCCTG study, it was demonstrated that relying on AE data to detect peripheral neuropathy resulted in missing the incidence in >50% of patients.31 Thus, the clinical significance of AEs is brought into question. Despite the routine nature of the CTC AE reporting system, under-reporting is an ongoing problem, as evidenced by the median time to first occurrence of a severe AE in this analysis. There also is some indication that QOL assessments detect occurrences of severe AE earlier than those reported by CTC. Even among those who had an AE report, single-item QOL assessments detected a patient-perceived problem in peripheral neuropathy >6 weeks earlier than CTC criteria.31
QOL data have the advantage of including the patient's perspective on AE assessments. It also ensures that every patient is assessed for every potential treatment- or disease-related complication. It is important that clinicians respond to a patient-reported, clinically significant change in their QOL and not rely entirely on the CTC AE categories. However, to be fair, the CTC AE reporting system was not designed to report patient-perceived problems. Rather, it was intended to identify incidences of severe AE observed by clinicians that would indicate a need for immediate medical intervention.21 Hence, it is not surprising that QOL assessments that are intended to detect problems from the patient's perspective would be more sensitive. The CTC criteria serve a vital purpose in detecting extreme AE incidences that may cause clinical trials to be shut down. The fact that a patient does not experience a physician-based, AE-reportable incident, however, is not the same thing as detecting a problem that the patient perceives as needing intervention. QOL assessments can serve a supportive and supplementary role to ensure that any problem experienced by patients on clinical trials are detected and treated as soon as possible.
A few limitations of this pooled analysis are worth noting. This analysis included only 1 disease site (ie, lung), and the majority of patients (91%) reported their race as Caucasian; therefore, the current results may not be representative of other disease sites and patient populations. Also, the 6 studies were selected from a large database of NCCTG lung cancer clinical trials based on QOL instrumentation and, thus, have the potential for biased results. Finally, because of the retrospective nature of this analysis, the timing of assessments and the collection of AE data is not standardized across trials. Prospective assessment and validation in large, Phase III trials is needed to overcome these limitations and to establish routine clinical use.
More research is needed into the real-time feedback of clinically meaningful changes in patient-reported QOL and a delineation of the clinical pathways that need to be followed when such indications manifest. The use of modern technologic advances, such as personal data-acquisition devices and interactive voice-response systems, can make the real-time collection of QOL assessment data feasible and practical.32 The evidence to date, however, indicates that paper and pen assessment can be just as accurate as computer-based assessments.33 Further work needs to be done so that we can understand and optimize the application of such technology.
The current pooled analysis confirmed the hypothesis that the single-item assessment and the multiple-item indices indeed are related, but not to the point of pseudoredundancy. These results were consistent with the literature, indicating that a single-item index will capture a clinically significant change in QOL more readily than a multiple-item index. Our analysis also showed that measuring QOL differs quantifiably from gathering AE data. However, the primary motivating factor behind using a single-item index, a multiple-item index, or both always should be the research question under investigation.