Evaluating Health-Related Quality of Life in Cancer Clinical Trials: The National Cancer Institute of Canada Clinical Trials Group Experience
David Osoba, QOL Consulting, 4939 Edendale Court, West Vancouver, BC, Canada V7W 3H7. E-mail: firstname.lastname@example.org
Introduction: The National Cancer Institute of Canada (NCIC) Clinical Trials Group (CTG) Quality of Life (QOL) Committee was initiated in 1986.
Purpose: The purpose of this review is to describe the evolution of the Committee's work and to highlight key developments such as the formulation of a policy regarding health-related quality-of-life (HRQOL) assessment, the provision of guidelines to ensure completion of HRQOL data within the protocol requirements, the rationale behind the choice of HRQOL instruments, the timing of assessments and the development of data analytic methods. These developments are illustrated with examples from CTG studies.
Recommendations: There is a lack of concordance between conventional toxicity data and HRQOL data and comparative studies designed to elucidate these differences are to be encouraged. Also, more studies are required to compare different analytic strategies and to determine how much missing data is acceptable, particularly in oncology studies where attrition is inevitable.
The Quality of Life Committee—Description
The experience of the National Cancer Institute of Canada (NCIC) Clinical Trials Group (CTG) in health-related quality-of-life (HRQOL) assessment began in 1986 with the formation of a Quality of Life (QOL) working group. The role of the working group was to provide educational opportunities to the CTG clinical investigators and other CTG personnel. A standing QOL Committee was established in 1987 . Subsequently, HRQOL assessment in the CTG has evolved gradually in keeping with developing worldwide knowledge. In several ways, the CTG has been instrumental in introducing new information during this evolution. Some of these advances are detailed in this article, which gives particular emphasis to how our policy and procedures have led to useful methods and results, and how this experience may be useful for developing Food and Drug Administration guidance for labeling claims.
The Committee developed a policy pertaining to HRQOL assessment in the context of clinical trials . The policy stated that “there should be a statement about the anticipated impact on quality of life with every proposed Phase III clinical trial and whether QOL measures will be incorporated in the protocol.”
As a result of this policy, almost all of the 71 trials initiated by the CTG since its adoption in 1987 include HRQOL components. This provides the CTG with extensive experience to determine which procedures have worked best in given situations, and the results have provided new information. Between 1992 and 2006, the QOL Committee published 45 full articles, 69 abstracts, and 6 miscellaneous publications. The presence of a CTG policy, however, does not always lead to the inclusion of HRQOL assessments in intergroup trials if the lead group does not feel it necessary to include such assessment.
The QOL Committee provides writing guidelines for the inclusion of HRQOL assessment in clinical trials protocols [1,2]. HRQOL components should be a part of the main protocol and not an add-on. Inclusion of HRQOL components in the main protocol is desirable because adding on separate protocols creates the impression that HRQOL assessment is either not very important or that it is an afterthought.
The writing guidelines provide explicit instructions for the sections of the protocol dealing with the introduction, rationale and hypothesis, objectives of HRQOL assessment, eligibility criteria, study design, sample size, instrument description, instructions for administration of the instruments, timing of assessments , analysis, and wording of the consent form.
Instructions to Clinical Research Associates (CRAs)
At the outset, the QOL Committee developed instructions for the integration of HRQOL measurement into Phase III clinical trials . These instructions are used by CRAs, office data managers, and others who will be involved in collecting the data. They include the rationale for the HRQOL component, how to present HRQOL questionnaires to patients, how to collect them after completion, and how to transmit the completed questionnaires to the central CTG office. As a result, all personnel involved with an HRQOL component in a trial understand the need for measuring it and the procedure for standardized collection of the data. Because central office staff who conduct the main part of the clinical trial are the same individuals who receive the HRQOL data, there is integration of all components of the trial by one group of personnel. We believe that this is an advantage over having separate personnel dealing with the HRQOL component and with the primary component of the trial (i.e., the primary objective).
A particularly important instruction to the CRAs is that they should obtain a completed HRQOL questionnaire on a patient before calling the central office for randomization instructions. This is intended to make certain that questionnaire completion rates at baseline are high. Nevertheless, if a patient is not able to answer a questionnaire because of language barriers, but meets all the other inclusion criteria, the patient is not excluded from participation in the study.
Liaison with Disease Site Groups (DSGs)
The QOL Committee members provide liaison to DSGs and educational resources to members of the CTG. Members of the QOL Committee act as liaison members between the QOL Committee and specific DSGs so that the HRQOL component of a proposed trial can be incorporated into the protocol at an early stage of the planning process. In addition, feedback to the DSGs on the success of collection of the HRQOL data can be provided as the trial progresses. To ensure that DSGs are well acquainted with the need to collect QOL data and what this entails, the QOL liaison person is a member of the executive committee for each DSG.
Completion Rates of HRQOL Instruments (Compliance)
The above policy and procedural instructions are intended to produce high questionnaire completion rates so as to keep avoidable (random) missing data to a minimum. Completion rates are determined as follows: number of questionnaires received with sufficient items answered to be deemed complete ; the number of questionnaires received over the number of patients enrolled in the trial; and the number of questionnaires received over the number expected (number of patients still on study and required to complete questionnaires according to the protocol). To date, almost all trials with HRQOL components have baseline completion rates higher than 97% [5–7]. On-study completion rates are lower, but the number completed over the number that can be expected to be completed is usually more than 80% (unpublished data) [5–7]. These completion rates are among the best in the world, and we believe that they are a consequence of the diligent efforts of our CRAs and central office personnel.
Choice of HRQOL Instruments
When the QOL Committee was formed, there were only a small number of HRQOL instruments from which to choose. Only two or three had been developed for use in patients with cancer. One of the members of the QOL Committee became a liaison member from the CTG to the European Organization for Research and Treatment of Cancer (EORTC) Quality of Life Study Group and was made aware of the properties of the Quality of Life Questionnaire (QLQ) being developed by that group . The details of the questionnaire's reliability and validity were reviewed by the CTG QOL Committee, and it was decided to adopt the use of this questionnaire for CTG purposes. The decision was made to use this questionnaire as our standard questionnaire in clinical trials for a number of reasons. First, it provides separate domain and single-item scores rather than a single aggregated or summary score. Having separate scores allows a detailed picture of how different domains of HRQOL are affected by a disease and its treatment. Although a single aggregated score is appealing because it is simpler to deal with, it hides changes in HRQOL domains that move in opposite directions or that may differ according to the treatment a patient receives. Second, using the same instrument in a variety of disease sites allows us to become very familiar with its properties and behavior in a variety of cancer populations. Third, it allows us to make comparisons across clinical trials at different disease sites if we wish to do so. Finally, the use of one questionnaire allows for simplicity of administration at the clinic level and in data management at the central office. Eventually, we began to use other questionnaires if we were participating in a trial initiated by another cooperative clinical trials group or if it was felt that another instrument had some properties that made it desirable for use in a particular trial. We have used 24 different instruments, but the EORTC QLQ-C30 (the core instrument with 30 items) has been used in 51 of 71 (71.8%) trials.
Development of Study-Specific Checklists
Generic and general QOL instruments need to be supplemented with disease-specific modules  or with study-specific items to provide a more complete collection of the necessary HRQOL data . Almost no disease-specific modules/checklists existed for use in cancer populations by the early 1990s. By study-specific items, we refer to items that are specific to a particular trial and that ask questions not covered in the core questionnaire or a disease-specific module. Examples include symptoms caused by the disease or its treatment (e.g., difficulty passing urine in prostate cancer, bleeding from the vagina, the bladder, or rectum after radiation therapy (RT) for cervical cancer or for prostate cancer, or fever after treatment with interferon or from infection after chemotherapy.) Study-specific checklists are supplementary to disease-specific modules and general or generic instruments. Because they consist of single items, they cannot be tested for reliability or validity in the usual ways, but they can be assessed for understandability and acceptability to the patient.
Several study-specific checklists were developed by the QOL Committee. As an example, one of these has been used in the MA.14 breast cancer trial of tamoxifen versus tamoxifen plus octreotide in postmenopausal women with primary breast cancer who have undergone definitive surgical management . The analysis of the MA.14 HRQOL checklist results shows that, in the short term, the selective checklist item scores discriminate between patients differing in clinical status and are responsive to change over time as well as to treatment-induced differences. The selective checklist item scores correlate statistically significantly with the patient's global QOL scores and provide additional information to that provided by the EORTC QLQ-C30 core questionnaire and the EORTC breast cancer module QLQ-BR23 (http://www.cancer.gov/clinicaltrials/EORTC-15931).
In this section, the results of several CTG studies are presented to illustrate how HRQOL assessment helps to better understand the effects of cancer and its treatment on patients' lives, to determine which treatment produces better HRQOL in randomized trials, and to provide recommendations that may help in the formulation of a guidance document. Many of our findings are based on large numbers of patients pooled together, in some cases, from several trials. It was possible to do this because of our early decision to use the same instrument (the QLQ-C30) in most of our trials. Had we used different instruments in each trial, the numbers could not have been combined and would have been too small to provide reliable data.
Validation and Reliability Studies
Further validation (reliability and validity) of the QLQ-C30 was established in ovarian, breast, and lung cancer . In addition, CTG studies helped to modify the original version of the QLQ-C30 into the second and third versions of the questionnaire [13,14].
Nausea and Vomiting after Chemotherapy
Studies of chemotherapy-induced nausea and vomiting in 832 chemotherapy-naive patients, enrolled in studies of either highly or moderately emetogenic chemotherapy, elucidated the deleterious effects of nausea and vomiting together and of each independently on physical, social, and cognitive functioning as well as on fatigue, appetite loss, and global QOL [15,16]. A study of risk factors associated with postchemotherapy nausea and vomiting in these patients revealed that low social functioning, prechemotherapy nausea, being female, increased fatigue, and a lower performance status were associated with a higher risk of either vomiting or nausea after highly emetogenic chemotherapy . After moderately emetogenic chemotherapy, however, some prognostic factors seemed to be inconsistent, and the probability of postchemotherapy nausea and vomiting was more strongly influenced by the type of chemotherapy given and the type of antiemetic used rather than by patient or by environmental factors .
It was also established in a study of moderately emetogenic chemotherapy that patients could be relied upon to pay attention to the time frame of the questions, because their answers were consistent with the expectation that postchemotherapy nausea and vomiting would have a greater impact on HRQOL during the first 3 days after chemotherapy than during the subsequent 4 days . Very short time frames (e.g., 1 day) provide inconsistent data because the time frame is too short, and therefore 3- to 7-day time frames are recommended to provide the most reliable data .
In 2390 patients with a variety of cancers, greater fatigue severity was found to be associated with being female, metastatic disease, and poorer performance status. The oldest patients and patients with breast cancer reported less fatigue, while patients with ovarian cancer or lung cancer had greater fatigue . Patients in antiemetic trials whose nausea and vomiting were well controlled showed less fatigue than did those with poor control of nausea and vomiting. Nevertheless, patients who had complete control of nausea and vomiting after highly emetogenic chemotherapy still had greater fatigue than at baseline . Thus, not all postchemotherapy fatigue can be attributed to poor control of nausea and vomiting. These studies were important in establishing the concept that complete control of nausea and vomiting following chemotherapy is ideal .
Prognostic Factors for Survival and Treatment Effects
A study of prognostic factors for survival in a heterogeneous group of 474 patients who had either localized or metastatic disease and who had been entered into chemotherapy trials showed that lower global QOL scores on the QLQ-C30 at baseline were independently associated with shorter survival .
In other clinical trials, CTG studies have shown that pretreatment global QOL scores were a predictor of on-treatment global QOL in patients with malignant melanoma . In advanced ovarian cancer, patients treated with cyclophoshamide and cisplatin had less deleterious impact on HRQOL during treatment than patients treated with paclitaxel and cisplatin. The latter patients reported more difficulty with myalgia and neurosensory problems than did the former, but by a year after treatment, both groups were similar .
An intergroup trial between an Italian group and the CTG in Stage II and IV nonsmall-cell lung cancer (NSCLC) found that although patients treated with gemcitabine plus vinorelbine reported fewer problems with appetite, vomiting, alopecia, and ototoxicity than patients receiving cisplatin-based chemotherapy, their lung cancer symptoms were not as well controlled and there was no overall difference in global QOL between the cisplatin and non–cisplatin-based chemotherapy . QOL was the primary study end point in that study, and provided clinically useful information that was not previously known. A large study of the targeted agent, erlotinib, versus placebo as second- or third-line treatment for Stage IV NSCLC revealed a clinically and statistically significantly longer median time to deterioration of predefined index symptoms for patients on the erlotinib arm (e.g., 4.7 vs. 2.9 months for dyspnea) as well as improvements in global QOL and physical function. These findings confirmed that the observed survival benefit is truly of clinical benefit as it provides better palliation and QOL .
Another study in extensive-stage small-cell lung cancer, however, showed how HRQOL can be adversely affected by a new, dose-intensive therapy as compared with standard therapy . This trial was stopped before completion of accrual because of an excess of early deaths, but the HRQOL data also showed a clear worsening of physical functioning, fatigue, and global QOL in the dose-intensive arm. A randomized trial of preoperative versus postoperative RT in 190 patients with extremity soft tissue sarcomas showed that those who received postoperative RT had better limb function and less pain at 6 weeks after surgery than did patients with preoperative RT . There were no differences at later time points. The conclusion was that timing of RT had a minimal impact on function in the first year after surgery.
Observer Ratings versus Direct Patient-Reported Outcomes
A study using 12 simulated patients asked seven experienced clinical data managers to score toxicity grades using the NCIC CTG and the World Health Organization (WHO) toxicity scales . Modest levels of inter-rater reliability were demonstrated with kappa values that ranged from 0.50 to 1.00 in laboratory-based categories and from −0.04 to 0.82 for clinically based categories. Proportions of agreement for clinical categories ranged from 0.52 to 0.98. Condensing the toxicity grades improved statistics of agreement, but substantial lack of agreement remained (kappa range −0.04 to 0.82; proportions of agreement range 0.67–0.98). Thus, experienced data managers, when interviewing patients, draw varying conclusions regarding toxic effects experienced by such patients. Neither the NCIC CTG-expanded toxicity scale nor the WHO standard toxicity scale demonstrated a clear superiority in reliability, although the breadth of toxic effects recorded differed.
In addition, a low correlation exists between patient-reported toxicity and toxicity grade by observers . Between 1988 and 1990, the NCIC CTG conducted a multicenter Phase III trial that compared two immunomodulating agents as long-term adjuvant therapy in resectable malignant melanoma . Incidence and severity data on 11 symptoms of particular interest were collected by each of three methods: the case report flow sheet (FS) completed by study personnel, the Symptom Diary (SD), and the QLQ-C30 (the latter two patient-reported). Both the FS and SD included a preset list so that each toxicity required a graded response (0 = none).
Ten of the 89 available cases were randomly selected for an initial analysis. In four patients, all three methods produced either identical or only slightly differing records of the expected toxicities. In five patients there were discrepancies. In one patient, each method documented different toxicities. In four cases, the QLQ-C30 picked up more symptoms than the SD or FS. The SD in turn picked up more symptoms than did the FS. Finally, one case was not evaluable as treatment was stopped after 2 weeks, but the QLQ-C30 was not done until 16 weeks. Preliminary results thus indicate that reporting of treatment-related symptoms can differ substantially among these three methods of data collection.
A more recent study involving 303 patients with advanced breast cancer compared the level of agreement between patients' self-evaluations and health-care personnel's accounts of patients' symptoms . Fifteen “matched” symptoms from the QLQ-C30 and the NCIC CTG-expanded Common Toxicity Criteria were evaluated over seven common time points. Agreement was only fair to slight (kappas from 0.012 to 0.378) between patients and health care personnel and worsened over time. In general, patients reported many more symptoms than did health care personnel. Therefore, the use of only toxicity grading by health-care personnel may result in under-reporting of toxicity, thus altering study results. Further studies are required to explain these discrepancies.
Analysis of HRQOL Data
The analysis and reporting of HRQOL data has undergone a gradual evolution. Initially, NCIC investigators sought to determine whether there was a statistically significant difference in the means of domain mean scores between the treatment arms of a study . Those with higher scores may be healthier than those with lower scores. Using only mean scores may result in a possible survivor bias.
Subsequently, several studies have compared the means of the changes in HRQOL scores from baseline between treatment groups [16–18]. In this approach, the change in a patient's scores is calculated by subtracting the baseline score from subsequent on-study scores. Thus, only scores of patients who are still on study are used. The advantage is that a spurious increase in the group mean scores is eliminated because an individual patient's on-study scores are compared only with that patient's own baseline scores. The disadvantage, however, is that analysis of this subset of patients may limit the generalizability of the results to the whole sample. Also, when the individual differences are taken together, the mean change in scores from baseline between groups does not inform us of the clinical significance of the change in scores. For example, as alluded to earlier, is a difference of 5% in the mean change scores clinically meaningful? Knowledge of the change for the entire group also does not inform us of the differences reported by individual patients, some of whom may have experienced larger changes in scores while others experienced smaller changes.
Measuring Clinically Meaningful Change
Is a change of 3 or 5 points on a scale of 0–100 important from a clinical perspective? The CTG chose to study this question by building on an approach used to determine the minimal important difference (MID) as suggested by the studies of a group at McMaster University [33,34]. A Subjective Significance Questionnaire was used to determine the smallest amount of change on four QLQ-C30 domains that was perceptible to patients with advanced breast cancer and small-cell lung cancer as a change from the previous time that they had completed the QLQ-C30. It was found that the smallest subjectively significant change perceptible to patients is between 5% and 10% of the breadth of the QLQ-C30 for physical, social, and emotional functioning and global QOL . A change of 11% to 20% is moderate and changes of more than 20% are large. On average, these changes correspond to effect sizes of less than 0.5, 0.5 to 0.8, and more than 0.8, respectively .
Several other studies in patients with different diseases, completing different questionnaires, have shown that a change of about 7% to 8% of the scale breadth is perceived as a change from baseline [33–36]. By using other external anchors to determine a MID, a similar difference has been determined [37,38]. The magnitude of this difference is about 0.4 of a standard deviation or of an effect size [39,40]. Thus, this difference can be used as a cut point to distinguish patients who have experienced an HRQOL benefit from those who have not experienced a benefit. Using a cut point right at the MID, however, may include “false positives” as having experienced a true benefit. Therefore, a higher cut point (i.e., 10%) to distinguish those who have had a true HRQOL benefit from those who have not is recommended [41–47]. A 10% cut point is about 0.5 of a standard deviation or of an effect size [35,39,40]. In addition to being used for determining the proportions of patients with an HQROL benefit, the cut point can also be used to compare the duration of benefit within and between groups of patients. Also, the magnitude of these changes can be helpful for calculating sample sizes required to detect a specified change in clinical trials.
NCIC CTG Recommendations for Analysis
The NCIC CTG recommends a relatively simple, clinically practical analysis that consists of four steps . After calculation of the questionnaire completion rates and the baseline scores, the individual change scores from baseline are determined. Then, the proportions of patients who have reported a predetermined clinically significant change score (usually more than 10%) are calculated, and the differences in the number of patients who have benefited in each treatment group are tested for statistical significance. An advantage of this approach of calculating the proportion of patients with an “HRQOL response” is that patients with missing data may still be included in an intention-to-treat analysis.
A recent study compared the NCIC CTG QOL standard protocol with two other methods, summary measure and summary statistic-based approaches, and linear mixed model-based methods, in analyzing HRQOL data based from a randomized clinical trial in patients with advanced breast cancer . The different statistical approaches have both advantages and disadvantages. The CTG standard protocol is easy to implement and takes clinical importance into account in the analysis of the QOL data. Other summary measures and statistics-based approaches are also simple to interpret. The time effects cannot be assessed with these approaches. Model-based methods take the correlation between repeated measurements into account and could test the treatment effect over time. For a comprehensive exploratory analysis, model-based methods would seem to be essential. The assumptions underlying the model-based approaches, however, are difficult to verify and have a strong impact on the validity of the analysis results.
This study did not identify a method that is better than all other approaches, and our suggestion is that, in the analysis of QOL data, different methods should be explored to assess the robustness of the results. Other analyses, such as growth curves [28,49] or pattern mixture models [50,51], may be applied. An additional useful analysis is to calculate the proportion of patients reporting a “HRQOL response” within a treatment group. This is useful not only in nonrandomized Phase II trials, where there is no comparison group , but also in Phase III randomized studies [44,46].
The strategies for and the approach to assessing HRQOL in clinical trials have evolved over time within the CTG. Some of the strategies that were set in place at the outset, such as developing a policy, producing writing guidelines and explicit instructions to CRAs and other personnel, and assigning liaison members to DSGs, have proved useful over time. These strategies have produced high questionnaire completion rates with a minimum of randomly missing data due to administrative error. A rationale has been presented for our decision to use a single questionnaire, the QLQ-C30, in most of our trials. The QOL Committee of the CTG has produced simple guidelines for the analysis of HRQOL data that culminate in a determination of the proportions of patients who benefit from treatment. Whether our simple response-based analysis is sufficiently robust for most purposes is still uncertain, and it needs to be tested by comparison with more complicated analytic approaches.
Key points and recommendations from the NCIC CTG experience
|1 It is possible to collect quality data, even when HRQOL outcomes are complex|
| a. Institute a clear policy mandating HRQOL assessments in appropriate Phase III trials|
| b. Provide guidelines on the required HRQOL content in clinical trial protocols|
| c. Establish methods to minimize randomly missing HRQOL scores|
|2 Conventional toxicity data assessment has limited reliability for subjective end points|
| a. Validity and reliability of subjective (non–laboratory-based) data are poor|
| b. There is a lack of concordance between toxicity data and HRQOL data|
| c. Encourage comparative studies designed to elucidate these differences|
|3 QOL scores have repeatedly been validated and are subject to design detail|
| a. Validated as independent measures of outcome|
| b. Validated as independent predictors of outcome|
| c. Sensitive to time frame of interest|
|4 Simple analyses are preferred|
| a. Encourage comparative studies of different analysis strategies|
| b. Encourage studies to explore how much missing data is “acceptable”|
|5 Single-item outcome assessments should not be ignored and may have validity|
| a. Within a validated questionnaire|
| b. As single-item check lists|
| c. Encourage studies to determine the value of single items|