Assessing Utility Where Short Measures Are Required: Development of the Short Assessment of Quality of Life-8 (AQoL-8) Instrument

Authors


Graeme Hawthorne, Department of Psychiatry, The University of Melbourne, Level 1 North, Royal Melbourne Hospital, Grattan Street, Parkville, Vic. 3050, Australia. E-mail: graemeeh@unimelb.edu.au

ABSTRACT

Objectives:  As researchers seek to include clinical outcomes, the health-related quality of life (HRQoL) of participants and meet economic evaluation demands, they are confronted with collecting disparate outcome data where parsimony is imperative. This study addressed this through construction of a short HRQoL measure, the Assessment of Quality of Life (AQoL)-8 from the original AQoL.

Methods:  Data from the AQoL validation database (N = 996) were reanalyzed using item response theory (IRT) to identify the least fitting items, which were removed. The standard AQoL scoring algorithm and weights were applied. Validity, reliability, and sensitivity tests were carried out using the 2004 South Australian Health Omnibus Survey (N = 3015), including direct comparisons with other short utility measures, the EQ5D and SF6D.

Results:  The IRT analysis showed that the AQoL was a weak scale (Loevinger H = 0.36) but reliable (Mokken ρ = 0.84). Removal of the four weakest items led to an 8-item instrument with two items per subscale, the AQoL-8. The AQoL-8 Loevinger H = 0.38 and Mokken ρ = 0.80 suggested similar psychometric properties to the AQoL. It correlated (intraclass correlation coefficient) 0.95 (or 90% of shared variance) with the AQoL. The AQoL-8 was as sensitive to six common health conditions as the AQoL, EQ5D, and SF6D.

Conclusions:  The utility scores fall on the same life–death scale as those of the AQoL. Where parsimony is imperative, researchers may consider use of the AQoL-8 to collect participant self-report HRQoL data that is suitable for use either as reported outcomes or for the calculation of quality-adjusted life-years for cost-utility analysis.

Introduction

Worldwide there are a small number of multiattribute utility (MAU) instruments that are suitable for use at both the evaluation of health-related quality of life (HRQoL) and economic analysis (cost-utility analyses (CUA)) levels. These instruments include the Assessment of Quality of Life (AQoL; N = 15 items) [1], the EQ5D (formerly the EuroQol; N = 5 items) [2], the Health Utility Index Version 3 (HUI-3; N = 15) [3,4], the 15D (N = 15) [5], the Rosser/Kind index (N = 12 levels) [6], the QWB (Quality of Well-Being Scale; N = 27) [7], and the SF6D (which uses 10 items from the 36 in the host instrument, the SF-36) [8].

After Mill [9], the standard practice that has evolved is that researchers will include in their evaluation instrument battery only measures that are essential to the study's primary outcome. The inclusion of secondary outcome instruments is a luxury subject to the constraints of research funding, the willingness of clinicians and researchers to administer longer instrument batteries and the capacity of participants to complete clinical measures and/or questionnaires. Although inclusion of generic HRQoL instruments has become widespread, because MAUs are usually associated with secondary outcomes, their inclusion in instrument batteries may pose difficulties.

Two difficulties which may be important are that MAUs provide different estimates of HRQoL and utility [10–18]. Because of this, researchers are increasingly using multiple instruments, a practice requiring parsimony as each additional instrument adds to the research burden.

Parsimony (i.e., instrument length) is well understood in relation to the number of instrument items: the longer each instrument, the longer the questionnaire. Long questionnaires may lead to lower participation rates and increase response resistance causing poorer quality data (e.g., satisficing or missing data) [19–26]. For example, Dorman et al. [27] reported higher completion rates for the EQ5D when compared with the SF-36, so did Holland et al. [28] when comparing the EQ5D with the AQoL. Parsimony is also required to meet the measurement axiom that instrument scores should be isomorphic with the underlying construct; the longer an instrument, the greater the difficulty in meeting this requirement [22,29,30]. In the broad field of life satisfaction and health, these reasons are those for the development of the WHOQOL-Brèf (26 items) from the WHOQOL-100 (100 items) [23], and the SF-12 (12 items) from the SF-36 (36 items) [24].

Parsimony was a motivation behind the brevity of the EQ5D (designed to be used alongside other more comprehensive measures of quality of life) [2] and also the SF6D (thereby gaining two measures—the SF-36 health function profile and a utility estimate from administering the same set of items) [8]. Not surprisingly, these two measures are now the most widely reported MAU instruments: a search of PubMed showed that in 2007, there were 137 studies reporting use of the EQ5D and 33 the SF6D.

Despite this high and increasing use, both suffer serious defects. The EQ5D has been criticized for its simple descriptive system, the gaps in utilities (brought about by the N3-term which assigns an extra automatic disutility of 0.27 to persons endorsing the worst item level on any item) and the high proportion of cases achieving ceiling scores, whereas the SF6D has been criticized because it cannot assess utility scores <0.30, i.e., it offers a truncated range on the life–death scale [10,12,15,31,32]. Despite these limitations, it is likely that the popularity of these two instruments is, at least in part, a function of their parsimony.

Given that HRQoL is increasingly included as an outcome measure in health research, that there is greater interest in CUAs being conducted alongside trials and evaluations, and that researchers often incorporate two MAUs into study protocols given the limitations of existing MAUs, a challenge for those designing MAU instruments is to provide parsimonious measures that meet the axioms of utility theory and that are compatible with increasingly complex, longer, and more demanding study questionnaires.

This study addressed this issue through the construction and testing of a short version of the AQoL measure, the AQoL-8, while retaining the measurement properties (including utility function) of the full AQoL.

Methods

Participants

Two existing data sets were used to construct and then test the AQoL-8.

For the construction of the AQoL-8, the original Victorian validation study (VVS) database for the AQoL was reanalyzed [11,12]. Participants were a stratified sample of Victorian (Australian) residents. The strata were 1) randomly selected community members weighted by socioeconomic status to achieve representativeness of the Australian population; 2) outpatients randomly sampled within selected time frames attending two of Melbourne's largest public hospitals; and 3) purposively sampled inpatients from three Melbourne hospitals, based on severity of condition. The sample comprised 996 adults. The response rates were 58% (n = 396) for the community sample, 43% (n = 334) for outpatients, and 68% (n = 266) for inpatients. The community sample comprised 46% of the study population, outpatients were 38%, and inpatients 16%. Fifty percent of respondents were male, the mean age was 52 (SD = 18) years, 75% were Australian-born. Sixty percent of the sample was partnered (married, de facto), 18% were single, 11% were separated or divorced, and 12% widowed. For education attainment, 64% had completed primary school or high school, 13% held a trade certificate, and 23% a university degree. Forty-seven percent were in the labor force (working or studying).

To validate the AQoL-8, the 2004 South Australian Health Omnibus (SAHOS) database was reanalyzed. A full description of the SAHOS methodology can be found in Wilson et al. [33]. Briefly, the SAHOS is an annual user-pays population-based survey for health organizations covering both metropolitan and rural areas (rural towns with >1000 inhabitants covering ∼80% of the rural population). The sample was based on the collectors' districts used by the Australian Bureau of Statistics in the 2001 census, and was based on probability of selection proportional to size for every fourth household from a random start household. One interview was conducted per household where the respondent was the person who last had a birthday. Interviews were conducted by trained and experienced interviewers. For reliability purposes, reinterviews were conducted on a random 10%. The response rate was 72% (N = 3015). Forty-nine percent of respondents were male, the mean age was 45 (SD = 19) years, and 74% were Australian-born. Sixty-two percent of the sample was partnered (married, de facto), 24% were single, 9% were separated or divorced, and 6% widowed. For education attainment, 51% had completed either primary or high school, 13% held a trade certificate, 23% a certificate or diploma, and 14% a university degree. Sixty-six percent were in the labor force.

Ethics approvals for these two studies were obtained from the relevant ethics committees and written consent was obtained from all participants.

Measures

Demographic measures in both data sets included country of birth (Australia/Other), gender (male/female), age, education attainment (primary/high/trade/certificate or diploma/degree), relationship status (never partnered/de facto relationship/married/separated or divorced/widowed), and labor force participation (working/other).

The Assessment of Quality of Life (AQoL) instrument is a MAU instrument [1,11]. Although it consists of 15 items, 12 items are used in the utility scoring algorithm. These form four dimensions with three items each: independent living (IL), social relationships (SR), physical senses (PS), and psychological well-being (PW). For scoring, individual item responses are replaced with community preference values which were obtained from a representative sample of the population using time trade-off (TTO). A multiplicative model combines these into the four-dimension scores, again weighted by community preferences obtained through TTO. The resulting four dimension scores are then combined into a single score which is reweighted (again from a community sample based on TTOs) and presented as a utility score on a life–death scale where the end points are –0.04 (worse than death HRQoL equivalent state), 0.00 (death equivalent HRQoL state) to 1.00 (best HRQoL).

The criterion MAU for assessing the performance of the AQoL-8 against the full AQoL was the HUI-3 [34], an independent estimator of HRQoL. The reason for assessing the performance of the AQoL-8 and AQoL against an external MAU was that the analyses revealed that difficulties with the AQoL item No. 7 (Personal relationships) may have resulted in the AQoL producing biased estimates (please see the Discussion for an elaboration of this issue). In a previous study, the Barnett coefficient between the HUI-3 and AQoL was 0.84 [35] and both instruments loaded >0.80 on a structural equation model of HRQoL [36]. The HUI-3 measures “within the skin” functional capacity and comprises 15 items which form eight attributes (vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain) from which a utility score is derived using a multiplicative function, similar to that used in the AQoL [37–39]. The upper boundary is 1.00, and the lower boundary is –0.36; scores below 0.00 are described as the “pits.” In this study, HUI-3 utility scores were deciled (<0.00, 0.01–0.10, 0.11–0.20, etc.).

For researchers trying to decide which instruments to include in a study, it is important that there are head-to-head comparisons of scores so that informed decisions regarding the distribution of scores and the sensitivity of instruments can be made (where sensitivity refers to the ability of instruments to detect differences cross-sectionally between known groups). For this reason, scores on the AQoL-8 were compared with those on the two parsimonious MAUs, the EQ5D and SF6D. The EQ5D (formerly the EuroQoL) has five items, each with three response levels, measuring mobility, self-care, usual activities, pain/discomfort, and anxiety/depression [2,40]. Based on British weights, the utility function is computed using an econometric regression model [41–43]. The upper boundary is 1.00, and the lower −0.59. The SF6D [8] uses 10 items from the SF-36: three from the physical functioning scale, one from physical role limitation, one from emotional role limitation, one from social functioning, two bodily pain items, two mental health items, and one vitality item. Utility values are obtained through an additive econometric model. The end points are 1.00 for the best possible state, and 0.30 for the worst possible state.

Procedures

To shorten the AQoL, two overriding principles guided the study. The first was to retain fidelity to the original AQoL model through preserving its structure, and the second was to retain the TTO weights used to score the AQoL. The purpose was to have a final model with scores directly comparable to those of the AQoL.

For this study, a critical feature of the AQoL was the factorial structure of its descriptive system. The AQoL was constructed using factor analysis with a varimax rotation to ensure 1) all items loaded on a single principal component (>0.30), and that 2) all items within each dimension formed a unidimensional scale [1]. To test the robustness of the AQoL, during construction, items were iteratively removed and the stability of the model reexamined. The rationale for this procedure was that during use in real-world studies, an instrument was needed where missing data would not invalidate the utility score. To this end, the standard AQoL scoring algorithm made provision for the horizontal imputation of missing data [44,45] where one item was missing from each of the dimensions [46]; this implies that the AQoL utility score could be obtained from just eight items.

This property of the AQoL—a property that is unique among the MAU instruments—was exploited. It implied that it would be possible to identify the poorest (least contributing) item from each dimension and to remove it, then impute its value from the other remaining AQoL items while still retaining the overall AQoL structure, TTO weights, and scoring algorithm.

The unidimensionality of the descriptive system (i.e., scale homogeneity) was assessed using Mokken item response theory (IRT) analysis [47,48]. Items which “fit” a homogenous scale are those obtaining a Loevinger's Hi coefficient of homogeneity [49] >0.30 if the item is to be minimally acceptable and >0.40 for an item to be definitely included in the scale. Scale homogeneity is assessed through computation of Loevinger's H coefficient of homogeneity for the full scale. This is the weighted average of the Hi coefficients [49]. The accepted values for H are <0.4 indicating a weak scale, 0.40 to 0.50 a medium scale, and >0.50 a strong scale [50]. Items which did not fit the homogenous scale were identified as candidates for deletion.

In addition, items were examined using Rasch partial credit modeling which specifies that the probability of item endorsement at a particular level is a function of the underlying status of the person (the person location, expressed in logits) and the expected item response [51,52]. When a person selects an item response category, the probability of endorsing that category in preference to the next category can be calculated. Response category thresholds refer to the point where the probability of endorsing one response category is equal to the probability of endorsing the next. Good fitting models are where there is a graded monotonic relationship between the person location and the item response categories such that persons with low abilities (e.g., being unable to wash or toilet themselves) endorse low response categories (e.g., unable to carry out personal care tasks). Disordered probability thresholds occur where this monotonic relationship does not exist (e.g., persons who are unable to wash themselves may endorse responses indicating that they can perform all personal care tasks by themselves). Items with disordered thresholds are generally considered poor and should be either recoded or removed from scales.

A further criterion was that where all items within a subscale met these two criteria, the worst fitting item was identified as a candidate for deletion. After identification of candidate items for removal, regression models were constructed. For each candidate item, the raw (unweighted) noncandidate items remaining in the model were used as predictors in a regression analysis where the original candidate item was the dependent variable. The adjusted β-weights were examined and predictor items where βadj > 0.10 were retained. These were used as the predictors for proration of the missing values on the excluded items.

To select which items were retained in the AQoL-8 descriptive system, after proration, the imputed values for these items were iteratively entered into (i.e., replaced the true item values) and withdrawn (i.e., the true item values returned to the model) from the AQoL model. The start model was one in which the worst fitting item from each dimension was removed and the descriptive system examined. Each successive model had a prorated or original item reentered. This iterative process was continued until all possible combinations were examined. The criteria for final model acceptance was that which most closely approximated the original AQoL descriptive system, which provided utility scores as close as possible to the original AQoL in terms of item and scale Mokken analysis criteria, correlation with the full AQoL, mean and standard deviations as close as possible to those of the full AQoL, and a similar shape and trajectory over the full AQoL utility range. The final AQoL-8 descriptive system contained just eight items which need to be asked of participants, two representing each of the underlying four dimensions from the original AQoL descriptive system. During scoring of the AQoL-8, based on participants' responses to the eight items in the descriptive system, the values of four other items are imputed.

The standard AQoL scoring algorithm, which incorporated the TTO utility weights for items and composite health states, was then applied. This ensured that the life–death scale range for the AQoL-8 remained identical with that of the full AQoL, from −0.04 to 1.00.

The resulting AQoL-8 was then directly compared with the AQoL using the SAHOS data set. Three tests were carried out comparing the performance of the AQoL-8 with the AQoL: comparisons of general characteristics, tests of correlation using the intraclass correlation coefficient (ICC), and score distribution with the HUI-3 deciles as the “gold standard.”

The sensitivity of the AQoL-8 was compared with that of the AQoL and the two parsimonious MAUs, the EQ5D and SF6D, for a range of common health conditions, viz., depression, arthritis, diabetes, urinary incontinence, vision impairment, and obesity.

Multivariable linear regression (nontransformed data because this was used only to identify predictors) was used to identify the predictors of AQoL items identified as candidates for deletion, the paired t test was used to compare paired means, and analysis of variance (ANOVA) was used to compare mean scores between known groups. Data analyses were carried out in SPSS V16.0 [53], MSPWin 5.0 [54], and RUMM2020 4.0 [55].

Results

Construction and Initial Measurement Properties of the AQoL-8 from the VVS Database

Table 1 shows all 12 scoring items from the AQoL with their measurement properties, analyzed at both the overall scale level and also by each contributing dimension. Regarding the overall AQoL scale, the results suggest that it is a weak scale (Loevinger H = 0.36) but one that is reliable (Mokken ρ = 0.84). If dimensionality is ignored, the candidate items for deletion would be Nos. 7, 8, 10, 11, and 12 as all of these fall below the requirement of Hi > 0.30. These findings are not surprising because the dimensions were designed to be orthogonal to each other.

Table 1.  AQoL item properties (scoring items only), Mokken IRT analysis
 AQoL instrument analysisDimension scale analysis
Hi*HρStatus§Hi*HρStatus§
  • *

    Hi item coefficient of scalability.

  • H scale homogeneity.

  • Mokken ρ reliability.

  • §

    Candidate item for deletion (i.e., worst-performing items across the AQoL or within each dimension).

  • AQoL, Assessment of Quality of Life; IRT, item response theory.

Independent living
 4Self-care0.43   0.76  *
 5Activities of daily living0.43   0.79   
 6Mobility0.44   0.770.780.89 
Social relationships
 7Personal relationships0.22  *0.36  *
 8Social relationships0.30  *0.42   
 9Family role0.44   0.330.370.60*
Physical senses
10Vision0.28  *0.32  *
11Hearing0.21  *0.40   
12Communication0.27  *0.410.380.57 
Psychological well-being
13Sleep/rest0.37   0.44  *
14Anxiety0.38   0.45   
15Pain0.430.360.84 0.440.440.67 

Regarding the dimensions, the IL dimension would be considered a strong homogenous scale, and the PW dimension a medium scale. The other two scales (SR and PS) just failed to meet the criteria for a medium scale and would be regarded as weak scales. All items on the IL and PW dimensions met the criteria for definite inclusion, whereas the items on the SR and PS scales met the criteria for possible inclusion. The worst performing items were noted as candidates for deletion.

For the IL dimension, all three items met the criteria for inclusion both at the scale and instrument levels. The item fitting the model less well was No. 4 (Self-care). For the SR dimension, at the scale level, the item with the poorest fit was No. 9 (Family role) and then No. 7 (Personal relationships). At the instrument level, No. 7 performed poorly, as did No. 8 (Social relationships). Because it was a poorly fitting item at the scale and instrument level, No. 7 was the stronger initial candidate for deletion. All three items on the PS scale provided a poor fit at the instrument level analysis, whereas at the scale level analysis, No. 10 (Vision) was the obvious initial poorly fitting candidate item. Finally, for the PW scale, the item with the poorest fit to the model at the instrument level was No. 13 (Sleep); it was also the poorest fitting item at the scale analysis level.

Proration scores were prepared for all candidate items reported in Table 1. The regression analyses for all candidate items showed that the predictor items (βadj > 0.10) were:

  • 1No. 4 (Self-care): No. 5 (Activities of daily living, βadj = 0.35), No. 6 (Mobility, βadj = 0.38), No. 9 (Family role, βadj = 0.11) (βconstant: 0.08, R2 = 0.59, F = 127.47, P < 0.01);
  • 2No. 7 (Personal relationships): No. 8 (Social relationships, βadj = 0.26), No. 9 (Family role, βadj = 0.15), No. 15 (Pain, βadj = 0.11) (βconstant: 0.42, R2 = 0.20, F = 22.06, P < 0.01);
  • 3No. 8 (Social relationships): No. 7 (Personal relationships, βadj = 0.23), No. 9 (Family role, βadj = 0.14), No. 14 (Anxiety, βadj = 0.33) (βconstant: 0.37, R2 = 0.30, F = 37.07, P < 0.01);
  • 4No. 9 (Family role): No. 4 (Self-care, βadj = 0.14), No. 5 (Activities of daily living, βadj = 0.19), No. 6 (Mobility, βadj = 0.18), No. 8 (Social relationships, βadj = 0.11), No. 14 (Anxiety, βadj = 0.12), and No. 15 (Pain, βadj = 0.12) (βconstant: −0.07, R2 = 0.45, F = 70.49, P < 0.01);
  • 5No. 10 (Vision): No. 5 (Activities of daily living, βadj = 0.13), No. 6 (Mobility, βadj = 0.15), No. 11 (Hearing, βadj = 0.16) (βconstant: 0.60, R2 = 0.16, F = 16.82, P < 0.01);
  • 6No. 11 (Hearing): No. 10 (Vision, βadj = 0.16), No. 12 (Communication, βadj = 0.33) (βconstant: 0.48, R2 = 0.19, F = 20.45, P < 0.01);
  • 7No. 12 (Communication): No. 11 (Hearing, βadj = 0.32) (βconstant: 0.43, R2 = 0.20, F = 22.22, P < 0.01);
  • 8No. 13 (Sleep): No. 5 (Activities of daily living, βadj = 0.16), No. 14 (Anxiety, βadj = 0.22), and No. 15 (Pain, βadj = 0.17) (βconstant: 0.24, R2 = 0.28, F = 34.13, P < 0.01).

After proration, the iterative process described in the Methods section was performed. The final model, the AQoL-8, is given in Table 2, which can be directly compared with the original AQoL model in Table 1. The key difference is that the AQoL-8 has eight scoring items compared with the 12 in the full AQoL. The similarities are that from each of the original AQoL dimensions, two items have been retained, that the overall scalability of the model is similar to the AQoL, as is its reliability.

Table 2.  AQoL-8 final model item properties, Mokken IRT analysis
Item N (*)AQoL-8 instrument analysisDimension scale analysis
AB HiHρ§HiHρ§
  • *

    A, original AQoL item number; B, AQoL-8 item number.

  • Loevinger Hi item coefficient of scalability.

  • Loevinger H scale homogeneity.

  • §

    Mokken ρ reliability.

  • AQoL, Assessment of Quality of Life; IRT, item response theory.

Independent living
 51Activities of daily living0.43  0.80  
 62Mobility0.44  0.800.800.86
Social relationships
 83Social relationships0.30  0.27  
 94Family role0.47  0.270.270.42
Physical senses
115Hearing0.22  0.49  
126Communication0.27  0.490.490.60
Psychological well-being
147Anxiety0.38  0.44  
158Pain0.440.380.800.440.440.61

The four discarded items were Nos. 4, 7, 10, and 13. Items Nos. 4, 10, and 13 were discarded because they were the poorest fitting items from the IL, PS, and PW scales, respectively (Table 1); they were not intrinsically poor items. No. 7 (Personal relationships), on the other hand, was a poor item—it was the only AQoL item with disordered thresholds. Figure 1 shows that the probability threshold line 2 is dominated by all other threshold lines. It would have required recoding of the item responses to remove this disordering, but this would have violated the utility weights from the TTOs during scoring.

Figure 1.

Disordered thresholds for No. 7 (Personal relationships). HRQoL, health-related quality of life.

Regarding individual dimensions, these are given in Table 2 for comparison with Table 1.

After imputation of raw scores at the item level, the standard AQoL scoring system—which incorporates TTO values for the raw health states—was applied. The left-hand panel of Table 3, based on the VVS database, compares utility scores from the AQoL-8 with those of the AQoL on the criteria outlined above. Generally, there was little difference in the properties between the two measures. The slight loss in reliability reflects that the AQoL-8 is shorter than the AQoL.

Table 3.  Comparison of the AQoL-8 with the full AQoL, utility scores
 Construction sample*Validation sample
AQoL-8AQoLAQoL-8AQoL
  • *

    Data from the Victorian validation study database for the AQoL.

  • Data from the South Australian Health Omnibus.

  • The proportion of explained variance in AQoL-8 and AQoL scores explained by the AQoL-8 and AQoL constructs.

  • §

    Excludes prorated items. See the text for a discussion.

  • AQoL, Assessment of Quality of Life; ICC, intraclass correlation coefficient; IQR, interquartile range.

Score range−0.04–1.00−0.04–1.00−0.04–1.00−0.04–1.00
% of score range used100%100%100%100%
Mean0.660.650.820.81
SD0.280.290.210.20
Median0.720.730.900.87
IQR0.420.400.290.25
Explained variance76.60%65.10%71.47%59.06%
Scale homogeneity (Loevinger H)0.38§0.360.29§0.29
Reliability (Mokken ρ)0.80§0.840.80§0.78
ICC (AQoL-8 with AQoL)0.95 0.94 

The ICC between the two measures was r = 0.95 (Table 3), suggesting that 90% of the variance was shared. Nevertheless, the data distribution of the AQoL-8 is slightly different from that of the AQoL, and there were a small number of cases with considerable discrepancy in scores, as shown in Figure 2, which is a Bland and Altman plot of the limit of agreement [56]. The mean difference was 0.012 (95% confidence interval (CI): −0.01–−0.02). Ninety-seven percent of the cases obtained difference scores that were within the ±2SD limit of agreement. One area where there were differences between the AQoL-8 and AQoL related to the proportion of cases obtaining ceiling scores (1.00). This varied by participant type. For the AQoL-8 (AQoL), the proportions were for the community sample 21.2% (11.9%), for outpatients 7.9% (4.7%), and for inpatients 6.5% (3.0%).

Figure 2.

Bland and Altman plot. AQoL, Assessment of Quality of Life.

Examination of the 3% of cases outside the ±2SD limit showed that for 26/31, they were participants who, on the AQoL No. 7, endorsed a level of “4” (indicating that they had no close relationships); this item was excluded from the AQoL-8 because of its disordered thresholds (Table 1, Fig. 1). Given the measurement difficulties with this item, it is problematic to know whether the lower AQoL scores are underestimates or the AQoL-8 scores overestimates.

Validation of the AQoL-8 in the SAHOS Database

Validation of the AQoL-8 was undertaken using the 2004 SAHOS database.

The ICC between AQoL-8 and AQoL scores was r = 0.94, the mean scores were 0.82 (SD = 0.21) and 0.81 (SD = 0.20), respectively. The right-hand panel of Table 3 summarizes utility scores from the AQoL-8 with those of the AQoL on the study criteria. As shown, with the exception of scale homogeneity (Loevinger H), there was almost no difference in the scores between the AQoL-8 and AQoL. Possible reasons for the poor Loevinger H are presented in the discussion.

Both instruments' scores, and those of the EQ5D and SF6D, were examined by the utility scores from the HUI-3 deciles. The results are shown in Figure 3. The relationship between AQoL-8 and AQoL scores shows monotonically increasing mean scores for both over the entire HUI-3 score range. The only HUI-3 deciles where there were significant differences between AQoL-8 and AQoL scores were for the deciles 0.81–0.90 and 0.91–1.00 (paired t test, P < 0.05). The absolute differences, however, were small. The average difference in scores was 0.009 utilities (95%CIs: 0.006–0.011), and the biggest difference in mean scores was 0.02 in the range 0.21–0.30.

Figure 3.

AQoL-8, AQoL, EQ5D, and SF6D scores by HUI-3 deciles. AQoL, Assessment of Quality of Life; CI, confidence interval; HUI-3, Health Utility Index Version 3.

Comparisons between the AQoL-8, EQ5D, and SF6D suggest some differences. There was similarity in mean scores in the region 0.61–1.00, but means may hide other differences. The score range for the AQoL-8 was −0.04 to 1.00, with 29.6% of cases obtaining ceiling scores (1.00; for the AQoL, the percentage was 19.9%); the EQ5D score range was −0.48 to 1.00 with 42.5% of cases obtaining ceiling scores, and the score range for the SF6D was 0.30 to 1.00 with 9.8% of cases at the ceiling. As HRQoL utility deteriorated (i.e., lower HUI-3 deciles), the three instruments provided very different estimates of utility. The most obvious finding was that mean scores on the SF6D failed to fall below 0.50, regardless of the HUI-3 deciles, whereas those on both the AQoL-8 and EQ5D fell to mean scores <0.20 for those obtaining utility scores in the pits (≤0.00) on the HUI-3. The second finding was that the EQ5D provided very steep increments in utility by HUI-3 decile in the region <0.00 to 0.30, which corresponds to the region where the N3-term systematically depresses EQ5D utility scores. In contrast, mean scores on the SF6D fell from 0.50 to 0.49 for the HUI-3 first two deciles (≤0.00 and 0.01 to 0.10), suggesting invalidity. In contrast to these distributional difficulties on both the EQ5D and SF6D, the AQoL-8 provided consistent monotonic increments across the HUI-3 utility deciles.

Regarding the sensitivity of the AQoL-8, this was examined against the health conditions shown in Table 4. For people with these conditions, the mean difference between the AQoL-8 and AQoL was 0.00 for obesity; 0.01 for arthritis, depression, and urinary incontinence; for diabetes, it was 0.02; and for vision impairment, it was 0.04. On five of the measures, the AQoL was slightly more efficient, and on one measure, the AQoL-8 was more efficient; importantly, there were no statistically significant differences between AQoL-8 and AQoL scores on any of these health conditions (ANOVA, P > 0.05), except for vision (ANOVA, P = 0.00). Vision impairment was included in these sensitivity tests given the deletion in the AQoL-8 of the AQoL vision item (No. 10). The findings suggest that this made very little difference to the performance of the AQoL-8.

Table 4.  Tests of sensitivity comparing the AQoL-8 and AQoL
 NAQoL-8AQoLEQ5DSF6D
MeanSDMeanSDMeanSDMeanSD
  • *

    P < 0.0001.

  • Statistical test: ANOVA.

  • AQoL, Assessment of Quality of Life; BMI, body mass index.

DepressionNone25210.860.180.850.170.860.170.840.12
Minor/Other2530.690.230.680.220.720.240.700.14
Major2410.540.240.530.250.530.320.600.12
Statistics F = 378.83*F = 406.77*F = 342.44*F = 507.88*
ArthritisNo27280.830.200.820.200.840.200.810.14
Yes (all kinds)2870.690.250.680.230.660.270.710.14
Statistics F = 125.11*F = 132.65*F = 199.79*F = 131.74*
DiabetesNo27860.830.200.820.200.840.210.810.14
Yes2290.730.250.710.240.730.240.760.14
Statistics F = 50.68*F = 61.02*F = 41.76*F = 27.65*
Urinary incontinenceNo21130.850.190.840.180.850.200.830.14
Slight/Mild7140.780.200.770.200.790.200.770.14
Moderate1190.640.270.630.260.670.280.710.14
Severe610.500.310.510.310.500.390.640.17
Statistics F = 111.06*F = 111.56*F = 93.18*F = 77.83*
Vision impairmentPerfect/Slight/Moderate28810.830.200.820.190.830.210.810.14
Severe1320.590.290.550.280.650.280.700.15
Statistics F = 180.99*F = 238.15*F = 90.68*F = 76.16*
Obesity (BMI)Normal/Overweight22250.830.200.820.190.840.210.810.14
Obese5320.780.230.780.220.780.230.780.15
Statistics F = 21.64*F = 20.38*F = 26.33*F = 22.91*

Table 4 also shows similar tests of sensitivity comparing the AQoL-8, EQ5D, and SF6D. All three instruments were sensitive to all conditions. Each instrument proved to be the most sensitive on at least one test (the AQoL-8 was the most sensitive to diabetes, urinary incontinence, and vision impairment; the EQ5D was the most sensitive to arthritis and obesity; and the SF6D to depression). The SF6D, however, assigned the highest utility score for those with each of the conditions (other than for obesity). Particularly noteworthy in this respect was the very substantial difference for the SF6D on incontinence. The AQoL-8 and EQ5D scores were similar across all conditions, with the exception of vision impairment where for those with severe impairment, the AQoL-8 assigned a score 0.06 utilities lower than the EQ5D.

Discussion

Although there are seven MAUs suitable for use both as outcome measures and also in economic evaluation, where there are competing demands, the priority for researchers is to include in their evaluation instrument battery those measures that are essential to the study's primary outcome. The inclusion of secondary study outcome instruments is a function of available space in questionnaires and perceived respondent burden. Parsimony is thus important, and research decisions are often made to include an instrument in preference to another on this ground. The current study addressed this issue with respect to assessing participant self-report HRQoL through examining the descriptive system of the AQoL instrument with the express purpose of reducing its length. The result, the AQoL-8 instrument, has just eight items yet retains the structure and measurement properties of the full AQoL.

This study incorporated statistical modeling, based on reanalysis of two preexisting databases. It is possible that a better descriptive system could have been developed from the original AQoL item bank, which had 47 items in it [1], or from the development of new items. Similarly, revision of the existing AQoL items may also have provided a better model. Nevertheless, both of these options would have resulted in the need to collect new TTO weights for the descriptive system—a procedure that was beyond the current study. Furthermore, the AQoL is used in numerous studies across Australia and other countries and a major change of the questionnaire would prevent cross-study comparison and benchmarking to Australian norms.

The AQoL-8 retains fidelity to the full AQoL utility scoring system; indeed, it uses the same multiplicative function and weights. The removal of items and prorated substitute values, however, implies that within each of the four dimensions, there may be an element of double-counting (i.e., counting a health state more than once). The empirical findings of the study in Figures 2 and 3 suggest that any such double-counting does not materially impact upon the computed utilities.

There is however an important corollary. The full AQoL instrument has three items per dimension; as such, it meets the requirement of three items for each dimension score to be separately reported as a reliable scale score [57]. The data presented in Tables 1 and 2 suggest that although the reliability of the full descriptive systems is appropriate, the homogeneity of both the AQoL-8 and AQoL is such that they form weak scales. The reason for this, including the low Loevinger His for several items, is that the AQoL descriptive system was designed to capture representations across all dimensions of life contributing to HRQoL [1]. This design feature implies that there will always be some degree of independence among the items because of the breadth of measurement: it should not be expected that the descriptive system will be strongly homogenous. The AQoL-8 retains this breadth of measurement and has two items per dimension, which falls under the requirement for stable scale scores. Researchers need to be aware of this limitation in making the decision whether to report dimension scores. In general, this limitation suggests that it is preferable to report the utility score. If additional information is sought, then reporting of item scores would be appropriate.

Subject to these caveats, the psychometric properties of the AQoL-8 are similar to those of the AQoL: both have weak homogeneity but high reliability, and both have four dimensions with modest homogeneity and reliability (Tables 1 and 2). Beyond these psychometric considerations, both provide utility scores that meet the requirements of MAU theory: the AQoL explicitly through its developmental processes [11] and the AQoL-8 because it directly uses the AQoL scoring procedures.

As shown in Table 3, the AQoL-8 delivers utility scores that are equivalent to those of the AQoL; the ICC between the two instruments was 0.95 in the VVS database and 0.94 in the SAHOS database, suggesting that they are interchangeable measures. The Bland and Altman limit of agreement plot (Fig. 3) shows that across the score range, there was no particular pattern, suggesting that the similarities and differences in AQoL-8 and AQoL scores are not a function of HRQoL level. The mean difference between the AQoL-8 and AQoL scores (0.01 for both substudies) is well under the previously reported minimum important difference in AQoL scores (0.06) [58]. The higher explained variance and increased scale homogeneity of the AQoL-8 is a function of its greater simplicity and restricted measurement when compared with the broader measurement of the full AQoL.

The poor Loevinger H-values for both the AQoL-8 and AQoL in the SAHOS sample is caused by the items assessing hearing and communication in the AQoL-8, and both these and vision in the AQoL. In the VVS database for vision, hearing, and communication, just 2.5%, 3.6%, and 2.5% of the sample endorsed the worst two response categories, indicating moderate to severe impairment. When broken down by strata, for inpatients, the endorsements were 4.2%, 5.7%, and 5.3%, respectively, for outpatients, they were 3.8%, 4.4%, and 2.5%, and for the community sample, they were 0.3%, 1.5%, and 0.8%. In the SAHOS database, the percentages were 0.10%, 2.5%, and 0.9%, respectively. Given that population low vision prevalence rates are 4% [59] and a moderate hearing impairment in one ear at 7.6% and in both ears at 2.8% [60], it would appear that the community samples in both the validation study and the SAHOS severely underreport sensory losses, most probably because participants had to be able to talk with an unexpected interviewer and read to participate in the study, thus implying that there would be few cases endorsing major vision, hearing, or communication problems. The poor Loevinger H-values reported in Table 3 for the SAHOS validation sample almost certainly reflect this situation, whereas the H-values in the VVS database are due to the presence of the outpatient and inpatient samples where interviews were prearranged; importantly, these samples reported sensory losses similar to the known prevalences.

Where there were different scores on the AQoL-8 and AQoL (Figs. 2 and 3), this was primarily due to cases who, on the AQoL No. 7 (Personal relationships), endorsed a level of “4” (indicating that they have no close relationships). Detailed examination of this item, however, showed that it had disordered probability thresholds (Fig. 1) raising the distinct possibility of confounded measurement. It may be that this causes the AQoL utility scores for these cases to be somewhat underestimated. This information provided strong justification for the removal of this item (Table 1, Fig. 1). Difficulties were also observed with all three items on the PS scale. It is likely that the reason for these three items fitting poorly at the instrument level is related to the sampling strategy of the AQoL validation study as discussed above.

The results in Table 4 show that on six common health problems, the AQoL-8 delivered utility scores that were, for all practical purposes, equivalent to those of the AQoL. This is an important finding because it provides evidence that shortening the AQoL descriptive system has not impacted upon its sensitivity. That the scores are so similar also suggests that where the AQoL-8 is used to compute quality-adjusted life-years (QALYs) for use in CUA, the results will be very similar to those of the full AQoL.

The AQoL-8 was also directly compared with two previously published short MAUs, the EQ5D and SF6D. The findings (Table 4) suggest that the AQoL-8 performs at least as well as these two short measures but without their known limitations [10,12,15,31,32]—limitations that were evident in this study. One point of interest is in relation to the proportion of cases obtaining ceiling scores. These varied considerably between 9.8% (SF6D) and 42.5% (EQ5D). Both the AQoL-8 and AQoL instruments sat in the middle of this range, at 29.9% and 19.9%, respectively. Readers should note however that these ceiling scores were obtained on a population sample. Lower proportions of cases with ceiling scores can be expected from patient samples as reported above, where for inpatients, the proportions on the AQoL-8 and AQoL were 6.5% and 3.0%, respectively; both of which were well under the 15% standard for patient samples proposed by McHorney et al. [61]. The higher proportion of cases on the AQoL-8 than the AQoL reflects its simpler descriptive system.

The value of short outcome instruments, such as the AQoL-8, is that they impose lower demands on study participants. This is particularly important where study populations include patients with severe health conditions or those with limited cognitive capacity. Short measures also enable the collection of a broader range of outcome data (in the current context, the collection of utility data for economic evaluation alongside clinical outcomes), and they reduce research costs. The AQoL-8 was developed in response to requests from hospitals for a shorter measure than the full AQoL to reduce the administrative burden on clinicians and to lower the response burden on patients.

Conclusion

This study addresses difficulties that can arise for public health, health services, and clinical researchers when deciding on a set of parsimonious outcome measures. The development of the AQoL-8 from the AQoL using IRT analysis, statistical modeling, and comparison with external standards has resulted in a shorter but robust measure. The results indicate that the reliability of the AQoL-8 is high (0.80), and that the AQoL-8 provides utility scores that are, for all practical purposes, interchangeable with those from the full AQoL. Additionally, tests of sensitivity showed that the AQoL-8 was as sensitive to six common health conditions as the AQoL and also as the popular EQ5D and SF6D. The AQoL-8 utility scores fall on the same life–death scale as those of the AQoL and it may be used either to describe participants' HRQoL or in the calculation of QALYs for CUA analysis.

The AQoL-8 meets the requirements for a short HRQoL MAU-instrument, and may be considered by researchers who are confronted with the dilemma of trying to collect outcome data in health studies where parsimony is imperative.

This study was supported by a grant from Northern Health, Victoria, Australia. The AQoL validation study was funded by a grant from the Victorian Health Promotion Foundation, and the 2004 SAHOS survey by a grant from the Australian Commonwealth Department of Health and Ageing. None of these organizations participated in the design of the study, the analysis of the data, or in the preparation of the article.

I would like to thank A/Professor Richard Osborne for his valuable comments on the article, and also the extremely insightful comments of the reviewers.

Source of financial support: This study was supported by a grant from Northern Health, Victoria, Australia.

Ancillary