Do physicians correctly assess patient symptom severity in gastro-oesophageal reflux disease?
Dr C. A. Fallone, McGill University Health Center, Royal Victoria Hospital, 687 Pine Avenue West, Room R2.28, Montreal, QC, Canada, H3A 1A1.
Background : The accuracy of physicians’ assessment of the severity of gastro-oesophageal reflux disease is unclear.
Aim : To correlate physician and patient assessment of gastro-oesophageal reflux disease severity and its response to treatment.
Methods : Adult uninvestigated gastro-oesophageal reflux disease patients (n = 217) completed symptom and health-related quality of life questionnaires at baseline and after treatment with esomeprazole 40 mg p.o. daily. Pearson coefficients quantified correlations between physician assessments and patient responses.
Results : At baseline, the strongest correlations were heartburn severity (0.31), overall symptom severity (0.44) and a domain of the quality of life in reflux and dyspepsia questionnaire (0.31) (P < 0.001). Correlations of change with treatment were greater than baseline correlations: heartburn (0.39), overall symptoms (0.50) and global rate of change – stomach problems (0.72, all P < 0.001). The mean difference between the physicians’ assessment of change and the patients’ global rating of change was 0.20 (95% confidence intervals: 0.10–0.29) with physicians overestimating benefit.
Conclusions : Correlations were often significant, although weak to moderate and better with symptom severity than with health-related quality of life instruments as well as with change after therapy than at baseline. Increasing attention to health-related quality of life may help physicians better understand patients’ experience. In clinical trials, treatment success should be assessed by the patient as well as the physician.
Symptoms of gastro-oesophageal reflux disease (GERD) are common, affecting 25–40% of the population.1 Patients experience varying degrees of heartburn, regurgitation and some may have atypical manifestations that include chronic cough and asthma. Health-related quality of life (HRQL) is reduced in these patients to a level at times lower than that associated with untreated duodenal ulcer, angina, mild heart failure, diabetes or hypertension.2–5 Efficacious treatment exists for controlling symptoms, but response is not uniform,6 and may differ when rated by patients or physicians. Physicians’ expectations of response to therapy for GERD symptoms and the occurrence of a suboptimal outcome6 may lead to discordant opinions between the physician and the patient and to patient frustration.
Many observers suggest that a physician's ability to accurately assess patient symptom severity and HRQL is limited. The Rome II Working Party on Design of Clinical Trials, for example, recommends that assessment of symptom severity be recorded by the patient and not the physician.7 Discordance between physician and patient assessments has been clearly demonstrated in cancer patients.8–10 Physicians were unlikely to accurately determine the Karnofsky performance scale, the Spitzer quality of life evaluation and linear analogue self assessment scales.8 Several studies have found that physicians underestimate the severity of the cancer patients’ symptoms and the impact on HRQL,11–14 but overestimate patient anxiety and emotional distress.15 Clinician–patient discordance has also been demonstrated with musculoskeletal conditions,16, 17 where physicians again rated their patients’ health status higher than the patients themselves.18 In one study assessing outcome after total hip arthroplasty, the discrepancy increased when the patient was not satisfied with the outcome.19
As part of a multicentre trial, whose main objective was to investigate the use of marker states in HRQL assessment,20 we addressed physician–patient discordance in GERD using correlations between physician assessment of symptom severity with patient assessment as well as with validated HRQL questionnaires both at baseline and after a 4-week course of a proton pump inhibitor (PPI) in 217 GERD patients.
Uninvestigated adult out-patients presenting to 13 specialist gastroenterology practices and four general practices with a clinical diagnosis of GERD and identifying their main symptom as ‘a burning feeling rising from the epigastrium or lower part of the chest up towards the neck’21 entered the study between March 2002 and March 2003. A gastroscopy was not required for entry into this study, could not be performed within the 2 weeks prior to the study and was not performed during the study. Inclusion criteria were age 18 or over, symptoms of 3 months duration or greater and heartburn occurrence on at least four of the last 7 days prior to the first visit. In addition, when asked to think of their symptoms or limitations in the activities over the last week, patients had to report at least moderate problems on a validated 7-point scale ranging from no problem to very severe problems. Patients were ineligible if they had any alarm features, a history of intolerance or failure to respond to PPIs, a peptic ulcer within the last 10 years or required continuous concurrent therapy with non-steroidal anti-inflammatory drugs including aspirin (>325 mg/day) or with medication known to interfere with esomeprazole. Regular use of acid suppressive medication or prokinetics was not permitted for the 2 weeks prior to enrollment and other than the study medication, these agents as well as misoprostol, antibiotics or bismuth compounds were not permitted during the study.
Study design and therapeutic intervention
Following baseline assessments, patients were treated with esomeprazole 40 mg p.o. daily (Nexium, AstraZeneca, Mississauga, Canada) until follow-up 28 days (4–6 weeks) thereafter. Patients rated their symptoms and HRQL over the preceding week both at baseline and follow-up.
At baseline, patients completed the following questionnaires: the overall symptom severity scale (7-point scale ranging from no problem to very severe problem), the four symptoms scale (evaluates stomach ache, heartburn, belching and acid reflux on a 7-point Likert scale with one being no discomfort and seven being severe discomfort), the feeling thermometer, a standard gamble (SG) questionnaire, the Health Utility Index, the medical outcomes short form 36 (SF-36) questionnaire and the health-related quality of life in reflux and dyspepsia (QoLRAD) questionnaire (all described below). After PPI administration, subjects were again required to complete the overall symptom severity, four symptoms, feeling thermometer, standard gamble, QoLRAD and the Global rate of change questionnaires.
An experienced research coordinator trained all site interviewers in a day-long session. Because two centres enrolled French-speaking patients, all instruments not previously translated into Canadian–French were translated into Canadian–French and back-translated to English until the method centre investigators were satisfied with the back translation.
The treating physician assessed the patient's disease severity at baseline as none, mild, moderate or severe. At follow-up the treating physician performed an overall rating of the patient's change in symptom status on a 7-point scale ranging from very much improved to very much worsened.
Ethic review boards at all study sites approved the study protocol and all patients signed an informed consent form prior to enrollment in the study.
Feeling thermometer (FT). The FT is a visual analogue scale shown as a thermometer, in which the best state is full health (equal to a score of 100) and the worst state is death (a score of 0).22 A self-administered version of the FT was used. As this study was conducted as part of another,20 addressing the use of marker states with the FT, the subjects were randomized to receive the FT with or without three previously described marker states (mild, moderate and severe GERD).20, 23
Standard gamble. This instrument offers patients two options to choose from. Choice A is the certain outcome that the patient will stay in their own health state for t years until death. t was varied depending on the patient's age as previously described.20 The alternative (choice B) is a hypothetical treatment with two possible outcomes: (i) returning to full health (probability P) for t years, at the end of which the patient dies or (ii) immediate death (probability 1 − P). Interviewers varied the probability P in steps of 0.05 to obtain the value, P*, where the patient considered choice A equal to choice B. This indifference probability, P*, is the utility value for the health state of the patient's own health in choice A in the interval from death (=0) to full health (=1). The greater a respondent's willingness to accept the risk of a worse outcome (e.g. death) to avoid the health state in choice A, the lower the utility of the state in choice A.24 As with the FT, the SG was administered either with or without the three same marker states prior to rating their own health state.
Health utilities index mark 2 and 3 (HUI2 and HUI3). This is a 15-item self-administered questionnaire designed to quantify HRQL.25 Each item has four to six response options. There are eight attributes in the HUI3 classification system: vision, hearing, speech, ambulation, dexterity, emotion, cognition and pain. There are seven attributes in the HUI2: sensation, mobility, emotion, cognition, self-care, pain and fertility.
Medical outcomes short-form 36. The SF-36 contains 36 items that measure eight dimensions: physical functioning, role limitation because of physical health problems, bodily pain, general health perceptions, vitality, social functioning, role limitations because of emotional problems and general mental health. This questionnaire has been extensively tested for validity and reliability.26 Each domain is scored on a 0 to 100 scale where higher scores indicate better HRQL. Scores on the SF-36 can also be expressed as two summary measures, the physical component score and the psychological component score, which provide a measure of the overall effect of physical and mental impairment on HRQL.
Quality of life in reflux and dyspepsia. The QoLRAD consists of 25 items across five dimensions: emotional distress, sleep dysfunction, vitality, food/drink problems and physical/social functioning.27, 28 Respondents provide answers on a 7-point Likert-type scale. The lower the value, the more severe is the impact on daily functioning. The QoLRAD is reliable, valid and responsive.27, 28
Global rating of change. Patients are asked to indicate if they are better, the same or worse compared with the start of treatment. Patients reporting improvement or worsening would then proceed to estimate the magnitude of this change on a 7-point scale, thus, essentially constituting a 15-point scale. The global rating of change has been previously used to establish the minimally important difference in evaluative studies.29
Likert-type scales were treated as continuous data in order to calculate Pearson correlation coefficients between physician assessment at baseline and the patient assessment/questionnaires at baseline. Similarly, coefficients were calculated between physician assessment of change with therapy and the change in patient assessment/questionnaires. All scales to be compared were reordered such that they went in the same direction for improvement and worsening. Because of the uncertainty of applying parametric tests to categorical data, non-parametric Spearman–rank correlations were also determined. Given that these results did not differ from the Pearson correlations, they are not shown. Calculated P-values for the correlation coefficients are based on the t-distribution. In order to determine if the physician always underestimated or overestimated patient response to treatment, the mean physician score for change was compared, by using a paired t-test, to the mean patient-assessed global rate of change for stomach problems after the latter was converted from a 15-point scale to a 7-point scale. For all comparisons, a P ≤ 0.01 was considered significant.
Of the 237 patients who met all the initial screening criteria and were enrolled, 217 completed the study and were included in the present analyses. Table 1 shows the baseline characteristics of the patients.
Table 1. Demographic information and baseline characteristics
|Age (years)||49.7 (13.7)|
|Gender (% females)||52.5|
|Time since diagnosis (months)||86.3 (99.4)|
|Smoking history (% non-smoker)§||43.6|
|Race/ethnicity (% Caucasian)||88.0|
|Severity of GERD*,† (%)|
| Mild problem||3.7|
| Moderate problem||77.9|
| Severe problem||18.4|
|Overall symptom severity‡||4.65 (0.77)|
| Heartburn||4.49 (1.24)|
| Acid reflux¶||4.06 (1.56)|
| Stomach pain||3.86 (1.47)|
| Belching§||3.56 (1.59)|
|Feeling thermometer‡||0.670 (0.192)|
|Standard gamble‡||0.775 (0.194)|
|HUI 2 Utility score¶,‡||0.782 (0.158)|
|HUI 3 Utility score‡||0.795 (0.211)|
|SF-36 Physical component score**,‡||45.1 (8.7)|
|SF-36 Mental component score**,‡||47.6 (11.0)|
| Emotional distress||4.48 (1.42)|
| Sleep disturbance||4.48 (1.43)|
| Food/drink problems||3.83 (1.25)|
| Physical/social||5.45 (1.39)|
| Vitality||4.31 (1.32)|
All physician–patient correlations are shown in Table 2. At baseline, the only correlation coefficients with a value of >0.30 (arbitrarily chosen a posteriori as a threshold) were heartburn severity (R = 0.31, P < 0.001), overall symptom severity (R = 0.44, P < 0.001) and the food and drink dimension of QoLRAD (R = 0.31, P < 0.001).
Table 2. Correlations between patient and physician assessments
|Overall symptom severity||217||0.44‡||217||0.50‡|
| Acid reflux||215||0.22‡||212||0.24‡|
| Stomach pain||217||0.19†||217||0.25‡|
|Health utility index|
| HUI 2 Utility score||215||0.23‡||–||–|
| HUI 3 Utility score||217||0.09||–||–|
| Physical functioning||213||0.16*||–||–|
| Bodily pain||217||0.16*||–||–|
| General health||216||0.15*||–||–|
| Social functioning||217||0.24‡||–||–|
| Mental health||217||0.09||–||–|
| Emotional distress||217||0.26‡||217||0.27‡|
| Sleep disturbance||217||0.23‡||217||0.28‡|
| Food/drink problems||217||0.31‡||217||0.36‡|
|Global ratings of change in:|
| Stomach problems||–||–||215||0.72‡|
| Limitations in activities||–||–||216||0.36‡|
With 4 weeks of esomeprazole, there were marked improvements in patient-assessed heartburn severity, which went from a mean of 4.49 pre-treatment to 1.59 (mean change score of −2.90, 95% CI: −3.11, −2.69, P < 0.0001), acid reflux from 4.06 to 1.63 (mean change score −2.41, 95% CI: −2.63, −211.9, P < 0.0001) and overall symptom severity from 4.65 to 1.88 (mean change score −2.76, 95% CI: −2.94, −2.58, P < 0.0001). All measures showed statistically significant improvement in symptom or HRQL (Table 3). The patient-assessed global rate of change responses are shown in Table 4. Physicians assessed change in symptom severity as much or very much improved in 88.5% and no change or worse in only 3.7%.
Table 3. Post-treatment and change scores with esomeprazole
|Overall symptom severity||1.88 (1.13)||−2.76 (1.32)||(−2.94, −2.58)||<0.0001|
| Heartburn*||1.59 (1.03)||−2.90 (1.59)||(−3.11, −2.69)||<0.0001|
| Acid reflux†||1.63 (1.14)||−2.41 (1.62)||(−2.63, −2.19)||<0.0001|
| Stomach pain||1.95 (1.28)||−1.91 (1.79)||(−2.15, −1.67)||<0.0001|
| Belching*||1.99 (1.26)||−1.57 (1.62)||(−1.79, −1.35)||<0.0001|
|Feeling thermometer||0.85 (0.13)||0.18 (0.20)||(0.16, 0.21)||<0.0001|
|Standard gamble||0.84 (0.17)||0.07 (0.20)||(0.04, 0.10)||<0.0001|
| Emotional distress||6.52 (0.89)||2.04 (1.35)||(1.85, 2.22)||<0.0001|
| Sleep disturbance||6.55 (0.82)||2.07 (1.46)||(1.87, 2.27)||<0.0001|
| Food/drink problems||6.30 (1.02)||2.47 (1.30)||(2.29, 2.64)||<0.0001|
| Physical/social||6.69 (0.65)||1.25 (1.28)||(1.08, 1.42)||<0.0001|
| Vitality||6.40 (0.97)||2.09 (1.24)||(1.93, 2.26)||<0.0001|
Table 4. Patient-assessed global rating of change responses
|Stomach problems||19/215 (8.8%)||20/215 (9.3%)||62/215 (28.8%)||114/215 (53.0%)|
|Limitations in activity||69/216 (31.9%)||27/216 (12.5%)||43/216 (19.9%)||77/216 (35.7%)|
|Emotional||107/216 (49.5%)||15/216 (6.9%)||44/216 (20.4%)||50/216 (23.2%)|
Physician–patient correlations of change with treatment were generally greater than those at baseline (Table 2). Those with a value of >0.30 were heartburn (R = 0.39, P < 0.001), QoLRAD vitality (R = 0.32, P < 0.001), QoLRAD food and drink problems (R = 0.36, P < 0.001) and global rating of change in limitation in activities (R = 0.36, P < 0.001). In addition, overall symptom severity had a correlation coefficient of 0.50 (P < 0.001). The strongest physician–patient correlation was with global rating of change in stomach problems with R = 0.72 (P < 0.001). The physician-overestimated patient response as the mean physicians’ assessment of change score (1.59 ± 0.81 on a 7-point Likert-like scale where one represents ‘very much improved’ and seven, ‘very much worsened’) suggested more improvement than the patients’ assessment of global rate of change in stomach problems (1.78 ± 1.04). The mean difference between these two assessments was 0.20 (95% CI: 0.10–0.29, P < 0.001).
Estimating the severity of a patient's symptoms and, just as importantly, the impact of the underlying disease on that patient's HRQL is an essential component of providing proper medical care. Physicians have traditionally relied upon objective markers of disease severity such as oesophagitis grade, although conditions that lack demonstrable findings, such as endoscopy negative GERD, can be just as debilitating. Underestimating the degree of the patient's symptoms or the impact on that individual's HRQL can lead to frustration, impairment of the physician–patient relationship and disruption in the healing process. As declared by Carr and Donovan,17 the issue is not that the physician's assessment is right or wrong. Rather, both the patient and physician opinions may be valid, but failure to acknowledge that they may be different can result in less effective treatment.
Although our study demonstrated statistically significant associations between physician assessment and patient parameters, the correlation coefficients were not very strong with values rarely above 0.30 (Table 2). Global scales such as overall symptom severity (R =0.44) faired better than individual symptoms or HRQL scales and perhaps these are fairer comparisons given that the physician assessments were global assessments. In addition, associations were stronger with the changes with treatment than with baseline assessments in almost all symptoms or instrument scores (Table 2). Here too, global scales faired best (R = 0.72 for global rating of change in stomach problems and R = 0.50 for overall symptom severity). This observation is reassuring to clinical practice. Questions like ‘Overall, how is the severity of your symptoms?’ or ‘Overall, how much is the treatment helping you?’ would, thus, fair quite well in assessing treatment response. The magnitude of the difference between patient- and physician-reported outcomes was small (0.2 on a 7-point scale), but the difference was statistically significant. This finding suggests the potential existence of a physicians’‘expectation bias’ for a successful treatment and supports the recommendation of the Rome II Working Party on Design of Clinical Trials that assessment of treatment response should be performed by the patient.7
Associations were stronger with symptom scores compared with HRQL scores. For example, at baseline, overall symptom severity had a coefficient of 0.44, compared with only 0.05 with SG (Table 2). These findings are consistent with those from other conditions such as musculoskeletal diseases, where intraclass coefficients were 0.42 for pain, 0.11 for FT and 0.04 for standard gamble.18 Among the HRQL measures, we found correlations were higher with the disease-specific QoLRAD (0.31 for food/drink domain) than the generic instruments. The correlations of change in symptom scores with global ratings of change (stomach problems 0.72; overall symptom severity 0.50) were also greater than correlations with HRQL change scores (SG: R =0.13, FT: R = 0.27).
The modest correlation between physicians and patients is in keeping with reports of similar analysis in other diseases including cancer,8–14 and musculoskeletal conditions.16, 18 We do not believe that this is because of an inability of a particular instrument to assess patient HRQL as all instruments were previously validated for this purpose. Rather, reasons for this discrepancy may be the result of the difference in importance given to certain aspects of health status between physicians and patients. Physicians have a tendency to weigh their assessment of disease severity on symptoms rather than HRQL. Indeed, the best correlation with physician assessment was the patient-assessed global rate of change for stomach problems (R = 0.72).
In a survey among inflammatory bowel disease paediatric patients and their physicians, only two of the top 10 most important items of concern identified by the patients were ranked in the top 10 by physicians.30 Physicians over-estimated the importance of physical symptoms, whereas they underestimated the importance of HRQL issues such as ‘bothered by having to take medicines’, ‘worries about future health problems’ and ‘worries about weight’.30 In multiple sclerosis, patients and physicians also disagreed on which domains of health status were more important.31 More important to physicians were physical role limitations and physical function whereas, for the patient, emotional role limitation and mental health were more important.31
Another possibility is that the patient is trying to please the physician and not divulge in how ill he or she is actually feeling. Other possibilities include physicians’ failure to inquire in detail or patients’ difficulty explaining their experience in an articulate or easily understood fashion. In addition, in this study, the physicians were asked to generally rate the severity of the disease rather than to comment on HRQL. It is possible that correlation with HRQL would be better if they were specifically asked to incorporate HRQL in their assessment. Similarly, correlations could conceivably have been greater if the patients and physicians had answered the exact same questions. This would have been a more ideal comparison, but regardless, our results demonstrate that physicians may not sufficiently incorporate patients’ HRQL in their assessment of disease severity.
For total hip arthroplasty, it has been suggested that the following factors may contribute to discordance: (i) physicians and patients may have different expectations with regard to the results of the procedure; (ii) they may have a different definition of an excellent outcome; (iii) patients may not state their problems clearly for fear of disappointing the physician; (iv) the physician may not comprehend the true nature of the pain and the patient's level of satisfaction and (v) the patient's assessment may be influenced by the quality of the patient–physician relationship.19, 32 Certainly, all these factors can apply to GERD patients as well. The emotional impact of GERD in a trapeze artist or a carpenter who needs to hammer in a bent-over position for much of the day may not be immediately apparent to the physician, unless he delves into the daily activities in some detail. Otherwise, patient expectation, assessment of outcome and satisfaction may be quite different from that of the physician. In addition, patients may take into account the effort and emotional effects related to the difficulty in performing a function, whereas the physician may only rate the ability.16
The QoLRAD instrument had the highest correlation with physician assessment compared with the other HRQL instruments, both at baseline and post-treatment (Table 2). This is particularly true for the food/drink domain, again perhaps because physicians weigh their assessment of GERD severity predominantly on gastrointestinal symptoms. This is not surprising, in that disease-specific instruments focus on the problems related to the underlying condition.
Although the before–after design of this study, in contrast to a randomized trial, permits only weak inferences about treatment effects, our results are consistent with previous studies that demonstrate an excellent therapeutic effect of esomeprazole on GERD symptoms.33–35 We observed significant improvements in overall symptom severity, stomach pain, heartburn, acid reflux and belching (Table 3) in our patient population of moderate-to-severe GERD patients (Table 1). The majority of patients rated the improvement in stomach problems or activity limitations as a large change (Table 4) and HRQL improved significantly as measured with SG, FT and QoLRAD (Table 3).
There are some limitations to our study. As mentioned above, correlation could have been greater if the patients and physicians had answered the exact same questions. In addition, at baseline, physicians utilized a 4-point scale (rather than the 7-point scale used post-treatment), to assess disease severity. This scale was chosen in order to be consistent with other esomeprazole studies, which had used the same scale. The 4-point scale could reduce differences and perhaps a 7-point scale would have been more discriminating and possibly produced better correlation coefficients at baseline. In addition, the results are subject to the inherent limitations of correlation coefficients in assessing discordance between opinions. A low correlation coefficient does not exclude a non-linear relationship, although one would not expect such a relationship here. One also cannot extrapolate these results to patients outside the range of moderate-to-severe GERD, as these were the ones examined in this study (Table 1).
Hence, our results show statistically significant, although not particularly strong, correlations between physician assessment and patient questionnaires. The stronger correlations were with symptom severity scores rather than HRQL instruments and correlations of change after therapy were generally stronger than those at baseline, although physicians slightly overestimated patient response. Perhaps physicians should dwell more on the effects of GERD on the patient's HRQL when assessing the patient in order to better fully understand the patient's view and rating of his disease. This data reinforces the notion that in clinical trials, the assessment of impact of treatment should be performed by the patient as well as by the physician. Further studies comparing the way physicians and patients assess treatment success are desirable.
This work was supported by a grant from AstraZeneca Pharmaceuticals Inc. Participating investigators and affiliation: Dr Iain Murray, Quest Clinical Trials, Markham, ON, Canada; Dr Daniel Sadowski, Hys Medical Centre, Edmonton, AB, Canada; Dr Alan Barkun and Dr Serge Mayrand, McGill University Health Centre, Montreal General Hospital Site, Montreal, QC, Canada; Dr Ford Bursey, St John General Hospital, St John's, NF, Canada; Dr Naoki Chiba, Surrey GI Research/Clinic, Guelph, ON, Canada; Dr Lawrence Cohen, Sunnybrook and Women's College, Toronto, ON, Canada; Dr Carlo Fallone, McGill University Health Centre, Royal Victoria Hospital Site, Montreal, QC, Canada; Dr Francis Joanes, Port Arthur Clinic, Thunder Bay, ON, Canada; Dr Marc Bradette, L'Hotel-Dieu de Quebec, QC, Canada; Dr David Morgan and Dr David Armstrong, Hamilton Health Sciences, Hamilton, ON, Canada; Dr Sander Veldhuyzen van Zanten, Queen Elizabeth II Health Sciences Centre, Halifax, NS, Canada; Dr Pierre Pare, Hospital St Sacrement, Quebec, QC, Canada; Dr W. Olsheski, Albany Medical Clinic, Toronto, ON, Canada; Dr Ivor Teitelbaum, Yorkview Medical Centre, North York, ON, Canada; Dr Subodh Kanani, Lakeshore West Medical Professional Centre, Toronto, ON, Canada; Dr Paul Braude, Markham Research, Thornhill, ON, Canada.