Deriving an Algorithm to Convert the Eight Mean SF-36 Dimension Scores into a Mean EQ-5D Preference-Based Score from Published Studies (Where Patient Level Data Are Not Available)

Authors


Roberta Ara, Health Economics and Decision Science, ScHARR, The University of Sheffield, 30 Regent Street, Sheffield S1 4DA UK. E-mail: r.m.ara@sheffield.ac.uk

ABSTRACT

Objective:  The objective of the study was to derive a method to predict a mean cohort EQ-5D preference-based index score using published mean statistics of the eight dimension scores describing the SF-36 health profile.

Methods:  Ordinary least square regressions models are derived using patient level data (n = 6350) collected during 12 clinical studies. The models were compared for goodness of fit using standard techniques such as variance explained, the magnitude of errors in predicted values, and the proportion of values within the minimal important difference of the EQ-5D. Predictive abilities were also compared using summary statistics from both within-sample subgroups and published studies.

Results:  The models obtained explained more than 56% of the variance in the EQ-5D scores. The mean predicted EQ-5D score was correct to within two decimal places for all models and the absolute error for the individual predicted values was approximately 0.13. Using summary statistics to predict within-sample subgroup mean EQ-5D scores, the mean errors (mean absolute errors) ranged from 0.021 to 0.077 (0.045–0.083). These statistics for the out-of-sample published data sets ranged from 0.048 to 0.099 (0.064–0.010).

Conclusions:  The models provided researchers with a mechanism to estimate EQ-5D utility data from published mean dimension scores. This research is unique in that it uses mean statistics from published studies to validate the results. While further research is required to validate the results in additional health conditions, the algorithms can be used to derive additional preference-based measures for use in economic analyses.

Introduction

The quality-adjusted life-year provides a common metric by which the length of survival is combined with quality of life (QoL) [1]. It is advocated as the preferred measure of benefit for economic evaluations in health care because the results can be used by decision-makers to prioritize and allocate scarce health-care resources [2]. To facilitate comparisons across economic evaluations, policy decision-makers such as the National Institute of Health and Clinical Excellence (NICE) recommend a generic preference-based measure to value health-related quality of life (HRQoL), and NICE suggests the EQ-5D [2]. Nevertheless, there are three pivotal problems associated with the current recommendations. First, although HRQoL literature continues to grow, there are still a large number of health conditions which cannot be informed by current evidence. Second, a great deal of the published data describes HRQoL in terms of health profiles described by either disease-specific questionnaires or nonpreference-based instruments. Third, the published preference-based evidence may be derived from different instruments.

The underlying concepts of the preference-based indexes may be similar in that a score of one represents perfect health; zero represents death; and negative values represent health states considered worse than death. Nevertheless, fundamental differences, such as possible range, sensitivity to changes, and different preference-base weights, can produce wide variances in the scores produced by the different instruments for the same health state [3]. The choice of instrument used could have a substantial impact on policy decisions.

The most frequently used generic HRQoL questionnaires are the SF-36, EQ-5D, and Health Utility Index (HUI) [4–6]. While all have algorithms, which can be used to derive general population weighted preference-based utility indexes, these require access to individual patient level data [7–9]. In addition, as the number of preference-based instruments grows, it becomes increasingly important to be able to translate from one index to another. Current literature provides a limited volume of evidence which can be used to map from generic questionnaires to alternative preference-based utility indexes.

Older studies [10–13] concentrated on mapping the eight SF-36 dimension scores onto the preference-based HUI2 [7] or the Quality of Well-Being scale [14]. The majority of recent literature presents relationships between the two SF-12 summary scores [15] and the HUI3 [16–18] or the EQ-5D [17,19–22]. To our knowledge, there is no evidence that can be used to translate the SF-36 dimension scores onto the preference-based EQ-5D utility index.

The objective of the current study was to derive a method to predict a mean cohort EQ-5D preference-based index score using published mean statistics from the eight dimension scores of the SF-36 health profile. The results will expand the volume of published HRQoL evidence that can be used in economic evaluations and will facilitate comparability of the cost-effectiveness ratios generated using different preference-based measures. This research is unique in that it is the first publication to examine the relationship between the full SF-36 health profile and the EQ-5D preference-based measure. It is also the first study to validate the results of the research using mean statistics obtained from the literature.

QoL Instruments

The SF-36

The SF-36 is a generic health questionnaire which contains 36 items [5]. The health profile generated from the responses covers eight dimensions: physical functioning (PF), role limitations due to physical problems (RP), role limitations due to emotional problems (RE), pain (BP), general health perception (GH), vitality (VT), mental health (MH), and social function (SF). With each dimension consisting of several items coded 1, 2, or 3, the responses are summed to produce a raw dimension score (10–30), which is transformed onto a 0 to 100 scale where 0 and 100 represent severe impairment and no impairment, respectively.

The EQ-5D

The weights for the EQ-5D preference-based index used in the current study were obtained from the UK general public using the time trade-off method [9] The health states valued were sampled from the 243 possible health states derived from the three levels (no problems, some problems, extreme problems/unable) of each of the five dimensions (mobility, self-care, usual activities, pain or discomfort, and anxiety or depression) covered in the questionnaire. The health state valuations range from no problems on any dimension (EQ-5D = 1.0) to most severe impairment on all five (EQ-5D = −0.594).

Data Sets

Data Sets Used in the Regressions and Within-Sample Validation

Individual patient level data (n = 6350) pooled from 12 studies used in previous research in the School of Health and Related Research was used to explore the relationship between the dimension scores and the utility index. Collected during observational studies and randomized controlled trials, the data covers a wide range of health conditions including asthma [23], chest pain [24], healthy older women [25], chronic obstructive pulmonary disease [26], menopausal women [27], irritable bowel syndrome [28], trauma [29], lower back pain [30], leg reconstruction [31], leg ulcers [32], osteoarthritis [33], and varicose veins [34].

Out-of-Sample Data Sets Used to Validate the Models

Out-of-sample published data sets used to validate the models were identified by searching the MEDLINE database. Studies were retained if they reported the following statistics: 1) the eight mean for the dimension scores derived from the SF-36 questionnaire and 2) the mean EQ-5D preference-based utility score derived from the UK population weights.

Methods

While the objective is to estimate mean EQ-5D scores using published statistics of the SF-36 health profile, the regressions are constructed from individual patient level data.

Statistical Methods

The EQ-5D index is regressed onto the eight health dimension scores using ordinary least square (OLS) regressions. The general model is defined as:

image

whereby EQ represents the EQ-5D preference-based index, x represents the vector of main effects (PF, RP, BP, GH, VT, SF, RE, MH), d represents the vector of demographs (age and sex), r represents the vector of main effect squares, i represents individual respondents, and ε represents the stochastic error term of the regression (the residual).

The models were constructed in STATA (Release 10, StataCorp., College Station, TX) using backward and forward eliminations to select significant squared terms. White's robust standard errors were used to minimize the likelihood of incorrect inferences from the statistical tests [35]. Colinearity was assessed using the variance inflation factor (VIF). A suggested rule of thumb is that correlations must be stronger than |0.70| to be a problem [36]. Nevertheless, statistical theory suggests multicolinearity is only likely to become a problem if there is perfect correlation between independent variables (IVs) in conjunction with a small number of observations or a large variance in the IVs [37]. The Cook–Weisenberg and the Shapiro–Wilk test were used to detect heteroscedasticity and normality in the residuals, respectively.

Comparing Models

Goodness of fit for models obtained was assessed by examining the scatter plots of the observed versus the predicted indexes; the range of predicted values, the mean error (ME), mean absolute error (MAE), root mean squared errors (RMSEs), and the number of errors greater than 0.05 (0.025, 0.01) in absolute values. The R2 statistic quantified the explanatory power of the model, i.e., how much of the variability in the dependent variable is captured by the predictors used. The models' predictive abilities were also compared using standard descriptive statistics (mean, SD, max, min) for the individual estimates. The minimal important difference (MID), defined as the smallest difference in score which patients perceive as beneficial, was also used [38]. Because the objective of the study was to predict mean preference-based measures using summary statistics, cohort mean EQ-5D scores were generated for both within-sample subgroups and out-of-sample data sets, and the results were compared using the statistics described above.

Results

Within-Sample Data

Of the total 6350 individual respondents, 40% were male and the mean age was 52 years. The EQ-5D preference-based utility score was negatively skewed (skew = −1.4) with mean 0.71 (SD = 0.28) and median 0.76. The data covered the maximum possible range (−0.59 to 1) with 1512 (24%) individuals scoring the maximum value (EQ-5D = 1) and 241 (4%) rating their HRQoL as worse than death (EQ-5D < 0). All eight dimension scores were significantly correlated (Table 1) with the EQ-5D, with PF having the strongest relationship (Pearson's correlation 0.65, P < 0.001).

Table 1.  Summary statistics of the data sets used in the regressions and the within-sample subgroup validations
Health condition/study subgroupNAge MeanM %Mean dimension score
PFSFRPREMHVTBPGHEQ-5D
  1. Bold values not statistically significant (P > 0.05).

  2. M, male; PF, physical functioning; SF, social function; RP, role limitations due to physical problems; RE, role limitations due to emotional problems; MH, mental health; VT, vitality; BP, pain; GH, general health perception.

Total635051.94067.668.257.667.068.351.563.558.70.71
Asthma273047.74068.064.060.165.867.949.668.453.00.74
Chest pain62150.36273.571.749.863.866.250.450.457.90.79
Healthy older women25079.6048.175.242.660.571.651.758.358.50.61
Chronic obstructive pulmonary disease9463.95327.748.618.645.066.135.055.831.50.54
Menopausal women68153.5079.781.471.573.867.651.867.765.60.77
Irritable bowel syndrome34046.51579.978.466.869.768.151.865.063.70.75
Trauma15156.66041.150.016.345.061.859.149.869.70.57
Lower back pain12642.33758.062.525.064.368.146.331.262.80.54
Leg reconstruction8234.37141.659.538.456.968.954.648.261.80.50
Leg ulcer23273.43443.566.650.566.269.653.356.064.60.56
Osteoarthritis19466.84024.452.412.241.262.758.453.250.80.36
Varicose veins84950.42982.773.275.782.673.357.068.770.40.76
Pearson correlation with EQ-5D using individual patient level data (P < 0.001)            
Full data set6350  0.650.570.520.450.450.460.600.491
EQ-5D = 0.55502  0.600.470.470.370.400.460.570.43 
EQ-5D < 0.5848  0.410.340.380.240.180.250.440.38 
Pearson correlation with EQ-5D using mean cohort values (P < 0.001)   0.920.740.850.77−0.050.410.580.321

When grouped by condition (Table 1), the mean EQ-5D scores ranged from 0.36 for the cohort with osteoarthritis to 0.79 for the cohort suffering from chest pain. When using the subgroup mean scores, the correlations between the health dimensions MH, VT, GH, and the EQ-5D index were not statistically significant (P > 0.05).

Out-of-Sample Data Sets

A total of 22 published studies (Table 2) provided 63 sets of data, which can be used to predict mean EQ-5D scores using the eight mean health dimension scores. Eleven studies provided data sets (n = 31) that can be used to compare incremental scores between two or more subgroups, and 11 studies provided data sets (n = 24) that can be used to predict changes over time. The studies covered a wide range of health conditions including Anderson Fabry disease, asthma, coronary heart disease (CHD), claudication, dialysis, depression, diabetes, dry eye, femoral neck fracture, focal dystonia, hemophilia, hip replacements, liver transplant, lower back pain, osteoporosis, pain, psoriasis, renal transplantation, stroke, and walking impairment. Further details of the studies are available from the authors on request.

Table 2.  Summary statistics of the out-of-sample data sets used in validation
Health conditionStudy armTimeNAge MeanM %Mean dimension score
PFSFRPREMHVTBPGHEQ-5D
  1. Values assumed when predicting EQ-5D values are 50% male and age = 50 years.

  2. M, male; n/a, not available; PF, physical functioning; SF, social function; RP, role limitations due to physical problems; RE, role limitations due to emotional problems; MH, mental health; VT, vitality; BP, pain; GH, general health perception; UA, unstable angina; MI, myocardial infarction; CHD, coronary heart disease; S1, study arm 1; S2, study arm 2; S3, study arm 3.

Renal transplantationS1Baseline3505259.773.178.861.873.863.371.673.760.90.80
Anderson FabryS1Baseline383710065.657.053.956.141.360.755.837.60.56
HemophiliaS1Baseline6638 53.870.458.174.955.072.957.746.80.66
S2Baseline10046 67.979.981.286.161.474.076.864.10.85
StrokeS1Baselinen/a725219.041.08.048.042.070.064.055.00.31
S16 months9867 43.062.034.069.052.078.072.059.00.62
UA/MIS14 months895n/an/a63.073.451.772.052.372.463.459.50.748
S112 months n/an/a62.576.957.074.952.974.365.259.40.752
S24 months915n/an/a59.369.744.967.147.770.661.754.40.714
S212 months n/an/a61.072.952.472.450.372.564.155.40.736
ClaudicationS1Baseline88587241.067.030.063.054.073.050.056.00.57
S11 months84  73.076.059.073.063.076.072.061.00.79
S13 months84  74.080.063.069.065.078.074.060.00.77
S112 months72  73.083.064.077.061.075.074.058.00.75
DepressionS1Baseline250442869.030.222.49.122.224.552.038.30.33
CHDS1Baseline7867n/a62.673.956.870.261.576.780.153.30.69
Lower back painS1Baseline37503537.343.614.545.238.459.130.148.20.38
S13 months  3535.153.112.550.038.761.128.647.60.332
S16 months  3541.554.620.448.141.962.937.048.60.56
MIS11.5 months229627560.046.614.742.842.966.066.455.50.683
S112 months229637562.963.146.163.450.271.370.655.20.718
S212 months2296310066.564.650.666.252.072.672.055.80.735
S312 months22965049.958.237.053.941.867.265.452.50.66
Walking impairmentS1Baseline27677867.10.840.278.232.474.151.776.00.66
S2Baseline33677867.10.842.576.943.274.745.374.10.61
DiabetesS1Baseline326615661.40.674.483.052.270.956.878.60.81
S2Baseline889665865.70.648.166.620.150.239.171.50.63
S3Baseline1215656165.10.645.054.013.034.035.564.60.52
S4Baseline 655865.50.623.838.74.317.926.754.00.25
Low back painS1Baseline393444943.90.549.157.225.358.946.664.90.48
S2Baseline389434942.80.545.353.423.049.742.162.00.44
S18 months   44.60.554.164.537.460.746.565.50.56
S28 months   43.50.550.458.532.955.340.662.30.53
S124 months   45.90.556.466.444.061.746.264.30.60
S224 months   44.80.552.861.838.255.842.762.90.54
Hip replacementS1Baseline 683867.60.422.055.711.051.755.173.50.35
S112 months 693868.60.460.779.255.368.768.479.40.76
Dry eyeS1Baseline 392739.20.382.391.381.689.766.483.60.87
S2Baseline 552155.20.280.185.372.382.156.475.50.82
S3Baseline 58958.30.169.567.241.774.944.976.40.74
Femoral neck fractureS1Baseline 791979.20.264.477.351.083.365.982.10.82
S2Baseline 811980.80.264.278.545.278.061.276.50.85
S14 months   79.20.249.067.620.754.660.076.40.73
S24 months   80.80.243.565.223.446.848.168.30.60
PainS1Baseline 51.85024.339.8036.429.145.816.540.40.08
S2Baseline 58.45025.540.9039.428.64814.5430.07
S3Baseline 51.25029.139.3033.229.643.313.146.80.03
S17 days   17.340.9036.432.349.122.4320.11
S27 days   27.752.3054.637.760.634.446.90.12
S37 days   43.280.7087.958.174.566.645.70.61
HemodialysisS1Baseline 61.85645.660.42456.542.771.159.341.60.60
S2Baseline2360.29133.446.215.239.133.362.851.4300.45
S3Baseline10562.24948.363.62660.344.87361.142.20.65
PsoriasisS1Baseline37496075.771.876.467.652.967.965.3630.72
Liver transplantS1Baseline160535144.1945.3413.3237.783062.0656.4431.380.53
S13 months160  51.6663.5624.2765.7947.169.8357.4557.390.62
Focal dystoniaS1Baseline5059.23063.667.95361.254.862.260.547.80.59
6 weeks50  70.877.46375.863.270.964.251.30.66
12 weeks50  59.668.44959.853.666.460.848.30.63
S2Baseline5059.23073.874.452.075.955.662.755.753.20.60
6 weeks50  74.683.064.078.564.874.674.252.30.76
12 weeks50  74.270.946.073.255.059.057.549.00.66
Liver transplantS1Baseline 513869785969557370660.75
Pearson correlation with EQ-5D (P < 0.001)     0.820.850.810.760.790.790.890.611

The mean EQ-5D scores ranged from 0.03 for a cohort (n = 11) with central neuropathic pain to 0.87 for a cohort (n = 48) with dry eye. The incremental difference between different cohorts within the same study ranged from 0.01 for individuals enrolled in randomized controlled trials (RCTs) to 0.56 for diabetics subgrouped by self-assessed using the neuropathy total symptom scores (six domains).

Goodness of Fit and Accuracy of Models

Table 3 shows the results of the individual regression analyses using the eight health dimension scores and their squares. The explanatory power of the model using just the eight main effects (EQ [1]) was 56%, and the MAE and RMSE for the individual predictions were 0.134 and 0.183, respectively. The eight main effects were all statistically significant (P < 0.05) predictors. While the majority of the coefficients are positive, demonstrating the positive relationship with HRQoL, the coefficients for the RP and VT variables are negative. Although this might suggest colinearity with other variables, the VIFs were small at 2.34 and 2.39.

Table 3.  Prediction models using the main effects with and without significant demographics and squared terms (using individual patient level data)
 Model EQ (1)Model EQ (2)Model EQ (3)Model EQ (4)Model EQ (5)Model EQ (6)Model EQ (7)
betasebetasebetasebetasebetasebetasebetase
  1. Bold values not significant (P > 0.05).

  2. BP, pain; GH, general health perception; MAE, mean absolute error; ME, mean error; MH, mental health; PF, physical functioning; RE, role limitations due to emotional problems; RMSE, root mean squared error; RP, role limitations due to physical problems; SF, social function; VT, vitality.

Intercept0.032560.01220.065270.01530.076730.0146−0.181050.0272−0.182090.0273−0.169780.0271−0.168890.02707
PF0.003700.00010.003520.00020.003390.00020.007810.00050.007740.00050.007710.000480.007790.00048
SF0.001110.00020.001170.00020.001020.00020.002130.00060.002160.00060.002140.000590.002110.00059
RP−0.000240.0001−0.000270.0001    −0.000060.0001−0.000060.00008  
RE0.000240.00010.000230.00010.000170.00010.000220.00010.000230.00010.000230.000090.000220.00008
MH0.002560.00020.002600.00020.002390.00020.005990.00080.006070.00080.006090.000770.006010.00077
VT−0.000630.0002−0.000600.0002    −0.000340.0002−0.000340.00018  
BP0.002860.00010.002860.00010.002700.00010.004720.00060.004790.00060.004880.000550.004800.00055
GH0.000520.00020.000560.00020.000360.00010.000640.00010.000730.00020.000760.000160.000670.00015
Age  −0.000580.0002−0.000550.0002−0.000690.0002−0.000690.0002−0.001000.00020−0.001000.00020
PF*PF      −0.000040.0000−0.000040.0000−0.000040.00000−0.000040.00000
SF*SF      −0.000010.0000−0.000010.0000−0.000010.00000−0.000010.00000
MH*MH      −0.000030.0000−0.000030.0000−0.000030.00001−0.000030.00001
BP*BP      −0.000010.0000−0.000010.0000−0.000020.00000−0.000020.00000
Age*age          0.0000010.000000.0000010.00000
R20.5630 0.5638 0.5619 0.5852 0.5856 0.5868 0.5864 
ME0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
MAE0.1338 0.1336 0.1340 0.1305 0.1305 0.1299 0.1299 
RMSE0.1832 0.1829 0.1833 0.1784 0.1783 0.1780 0.1781 
<|0.05|181729%180829%181729%157525%157325%156925%156725%
<|0.025|96015%96615%95415%72912%73612%71711%71211%
<|0.01|4056%4157%4157%3145%2955%2794%2804%
 ActualPredict Predict Predict Predict Predict Predict Predict
Mean0.7130.714 0.715 0.715 0.713 0.713 0.715 0.715
SD0.2770.208 0.208 0.208 0.212 0.212 0.212 0.212
Min−0.5940.033 0.040 0.053 −0.211 −0.212 −0.210 −0.209
Max1.0001.052 1.059 1.057 0.991 0.989 0.996 0.997
Range1.594             

Plotting the residuals, observed and predicted EQ-5D data (Fig. 1) show that model EQ (1) underpredicts at higher levels of utility and overpredicts at lower levels as is commonly seen in HRQoL regressions. The Cook–Weisberg test (χ2 = 2445, P < 0.001) suggest the residuals are heteroscedastic while the larger errors at the lower end of the utility scale (Fig. 1) show the variance in the residuals are not normally distributed (Shapiro–Wilk test z = 13, P < 0.0001). Looking at the observed maximum EQ-5D (24% = 1) data points, the ME in the predicted values was 0.10 (range −0.5 to 0.70) with 551 (31%) of values underpredicted by at least 0.1. Conversely, looking at the lower quartile (EQ-5D = 0.62), the ME in the predicted values was −0.14 (range −1.16 to 0.43) with 834 (55%) of values overpredicted by at least 0.1.

Figure 1.

Residuals, observed and predicted EQ-5D values for model EQ (1).

While sex was not statistically significant (not shown), the inclusion of age as a predictor (model EQ [2]) has little impact on the explanatory power of the model (56%), the MEs, or the accuracy of the predicted values. Removing the variables RP and VT also has little impact on the explanatory power or accuracy of the predicted values (model EQ [3]). When including the statistically significant squared dimension scores (PF, SF, MH, and BP), the explanatory power increases slightly to 58.5% (models EQ [4] and EQ [5]). The MAE and RMSE for both models were 0.131 and 0.178, respectively. The inclusion of the variable age squared as a predictor in addition to significant squared dimension terms (models EQ [6] and EQ [7]) produced models with the highest explanatory power (>58.6%), and smallest MAEs and RMSEs (0.1299 and 0.178, respectively).

When including the significant squared dimension scores, models EQ (4) and EQ (5) predicted the total mean value for the individual level data correct to three decimal places as opposed to two decimal places for the other models. The range (min – 0.21 to max 0.997) for the models (EQ [4] to EQ [7]) that include the squared terms was larger than the range (min 0.033 to max 1.059) for the models (EQ [1] to EQ [3]) that do not include the squared terms. The variance in the individual predicted scores was underestimated by all the models with standard deviations ranging from 0.208 (EQ [1] to EQ [3]) to 0.212 (EQ [4] to EQ [7]) in comparison to 0.277 in the observed values.

The relationships between the actual and predicted EQ-5D scores for models EQ (1), EQ (4) and EQ (7) are shown in Fig. 2. Models EQ (4) and EQ (7) generate more negative values than model EQ (1), so the variance in the errors is greater.

Figure 2.

Plot of the predicted (models EQ [1], EQ [4], and EQ [7]) against actual EQ-5D scores.

Accuracy Using Within-Sample Mean Statistics

The primary objective of the study was to predict the mean preference-based EQ-5D utility scores using reported mean dimension scores derived from the responses to the SF-36 questionnaire. Using the mean values from the eight dimension scores to predict the mean EQ-5D score for the full data set (Table 4), models EQ (1), EQ (2), and EQ (3) were much more accurate with MEs of 0.001, than models EQ (4) to EQ (7), which produce MEs greater than 0.067. When subgrouped by health condition, the ME (MAE and RMSE) in the 12 predicted values were approximately 0.021 (0.045 and 0.056) for models EQ (1) to EQ (3) compared with 0.076 (0.082 and 0.093) for models EQ (4) to EQ (7). These statistics show that the models that include the squared terms tend to overpredict the mean values. Models EQ (1) to EQ (3) predicted between 62% and 69% of the scores to within |0.05| as opposed to between 15% and 31% for models EQ (5) and EQ (7), respectively. Model EQ (3) has the largest proportion (54%) of predicted values with errors smaller than |0.025|. Model EQ (1) predicted 85% of values to within the MID of 0.074.

Table 4.  Actual and predicted mean EQ-5D scores using summary statistics for within-sample subgroups
Data setActualPredicted EQ-5D utilityError
EQ-5DEQ (1)EQ (2)EQ (3)EQ (4)EQ (5)EQ 6)EQ (7)EQ (1)EQ (2)EQ (3)EQ (4)EQ (5)EQ (6)EQ (7)
Within sample subgrouped by health condition (n = 13)
  1. COPD, chronic obstructive pulmonary disease; ME, mean error; MAE, mean absolute error; RMSE, root mean squared error.

Total0.7130.7140.7150.7150.7810.7790.7810.7820.0010.0010.0010.0670.0660.0670.069
Asthma0.7410.7220.7230.7240.7910.7900.7910.793−0.019−0.018−0.0170.0500.0490.0500.052
Chest pain0.7860.6990.6990.6980.7490.7490.7500.751−0.087−0.087−0.088−0.036−0.037−0.036−0.035
Healthy older women0.6100.6450.6340.6330.6950.6940.6930.6940.0350.0240.0230.0850.0840.0830.084
COPD0.5370.5180.5160.5130.5540.5550.5550.554−0.019−0.020−0.0240.0170.0180.0180.017
Menopausal women0.7660.7860.7830.7820.8210.8210.8220.8220.0200.0180.0160.0550.0550.0560.056
Irritable bowel0.7520.7760.7770.7750.8170.8170.8190.8180.0240.0250.0230.0650.0650.0670.066
Trauma0.5730.5460.5490.5510.6220.6180.6200.624−0.027−0.024−0.0220.0490.0450.0460.051
Lower back pain0.5400.5920.6000.5940.6530.6530.6560.6560.0520.0600.0540.1120.1120.1160.115
Leg reconstruction0.4980.5690.5840.5870.6560.6530.6590.6620.0710.0860.0890.1580.1550.1610.164
Leg ulcer0.5570.6090.6010.6040.6710.6690.6680.6700.0520.0440.0470.1140.1120.1120.114
Osteoarthritis0.3630.4900.4890.4950.5320.5270.5270.5320.1260.1260.1320.1680.1630.1640.169
Varicose veins0.7570.8050.8040.8040.8430.8420.8430.8440.0480.0470.0470.0860.0850.0860.086
ME        0.0210.0220.0220.0760.0750.0760.077
MAE        0.0450.0450.0450.0820.0810.0820.083
RMSE        0.0550.0560.0580.0930.0910.0920.094
<|0.05|        62%69%69%23%31%23%15%
<|0.025|        38%46%54%8%8%8%8%
<|0.01|        8%8%8%0%0%0%0%
<MID               

Accuracy Using Out-of-Sample Mean Statistics

When assessing the models' accuracy in predicting the out-of-sample mean EQ-5D scores using summary statistics (Table 5), of the 63 predicted scores, the models with the squared terms (models EQ [4] to EQ [7]) produced larger MEs greater than 0.097 as opposed to below 0.05 for the models that do not include the squared terms. There is little to choose between the MAE for models EQ (1), EQ (2), and EQ (3) at 0.0641, 0.0654, and 0.0654, respectively. Models EQ (4) to EQ (7) were less accurate with an MAE greater than 0.098. Approximately 62% (37%) of the scores for models EQ (1) to EQ (3) were correct to within |0.05| (|0.025|) compared with 30% (8%) of the scores for models EQ (4) to EQ (7) (Table 5 and Fig. 3). Model EQ (2) has the greatest proportion (78%) correct to within the MID.

Table 5.  Errors in the predicted out-of-sample mean EQ-5D scores (n = 63)
ModelMEMAERMSE<|0.05| (%)<|0.025| (%)<|0.01| (%)<MID (%)
  1. ME, mean error; MAE, mean absolute error; RMSE, root mean squared error; MID, minimal important difference.

EQ (1)0.04980.06410.101163371476
EQ (2)0.04950.06540.102462371178
EQ (3)0.04760.06540.101662351175
EQ (4)0.09800.09880.12343010244
EQ (5)0.09720.09810.1228298244
EQ (6)0.09790.09890.1236308244
EQ (7)0.09870.09950.1241308244
Figure 3.

Distribution of the errors in the predicted out-of-sample mean EQ-5D scores.

A histogram (Fig. 3) of the distribution of the errors in the predicted out-of sample mean EQ-5D scores for models EQ (1) and EQ (4) shows that model EQ (1) has a large proportion of values (29%) that are identical to the actual value. Model EQ (1) has a substantial proportion (48%) that are within 5% of the actual value.

When comparing how accurate the models are when predicting incremental differences between study arms (n = 31), all the models produced an ME smaller than 0.0023 (Table 6) and approximately 70% of values were accurate to within |0.05|. Almost 50% of the incremental values were also accurate to within |0.025| and more than 77% were accurate to within the MID. When looking at the incremental changes over time (Table 7), the models that include the squared terms (EQ [4] to EQ [7]) produced the smallest errors. The MEs for models EQ (4) to EQ (7) were approximately 0.027 compared with approximately 0.034 for models EQ (1) to EQ (3). Sixty-seven percent (46%) of incremental changes over time were correct to within |0.05| (|0.025|) for models EQ (4) to EQ (7) compared with 63% (<42%) for models EQ (1) to EQ (3). Models EQ (1) to EQ (3) predicted 83% of changes to within the MID.

Table 6.  Incremental differences of mean EQ-5D scores between study arms (n = 31)
ModelMEMAERMSE<|0.05| (%)<|0.025| (%)<|0.01| (%)<MID (%)
  1. ME, mean error; MAE, mean absolute error; RMSE, root mean squared error; MID, minimal important difference.

EQ (1)−0.00070.04980.077471452681
EQ (2)0.00020.05060.077471421977
EQ (3)0.00140.05240.079871451977
EQ (4)−0.00190.04410.068568482981
EQ (5)−0.00230.04400.068571452681
EQ (6)−0.00210.04410.068571452681
EQ (7)−0.00160.04410.068471452981
Table 7.  Incremental changes of mean EQ-5D scores over time (n = 24)
ModelMEMAERMSE<|0.05| (%)<|0.025| (%)<|0.01| (%)<MID (%)
  1. ME, mean error; MAE, mean absolute error; RMSE, root mean squared error; MID, minimal important difference.

EQ (1)0.03350.05270.078863422183
EQ (2)0.03440.05310.079363382183
EQ (3)0.03420.05320.080163382183
EQ (4)0.02650.04800.067767462175
EQ (5)0.02670.04840.068167462175
EQ (6)0.02680.04860.068267462571
EQ (7)0.02650.04820.067867462175

Overall, when comparing the errors in the out-of sample predicted mean EQ-5D scores, model EQ (1) was the most accurate. Looking at Fig. 4a, which shows the actual out-of-sample mean EQ-5D scores, the predicted values and errors for model EQ (1), it can be seen that the model overpredicts the lower EQ-5D scores and there is a trend for the errors to increase in magnitude as the mean EQ-5D score decreases. When assessing the accuracy in predicting incremental values, i.e., differences between study arms and changes over time, model EQ (4) was the most accurate overall. Looking at Fig. 4b, there is a tendencyfor the errors in the predicted incremental values to increase in magnitude as the difference in the scores increases.

Figure 4.

(a) Actual out-of-sample mean EQ-5D scores, predicted values and errors for model EQ (1). (b) Incremental out-of-sample EQ-5D scores, predicted incremental values and errors for model EQ (4).

Applying the Algorithm

An illustration of how the algorithms are applied is provided below using the summary statistics for the liver transplant cohort (Table 2) and model EQ (1).

image
image
image

This gives an estimate EQ-5D score of 0.706 compared with the actual value of 0.750.

Discussion

The SF-36 is one of the most widely used HRQoL instruments, but because the SF-6D is a comparatively new measure, a substantial proportion of the SF-36-based research is presented in terms of a health profile using the mean values of the eight dimension scores. Because the number of preference-based instruments continues to grow, if the results of economic evaluations are to be compared effectively, it will become increasingly important to be able to translate between the different QoL measures used. The models pre sented here offer two substantial benefits: first, they provide a mechanism to obtain preference-based utility scores from published nonpreference-based QoL data, and second, they provide analysts with a methodology to map between two of the most frequently used HRQoL instruments.

The OLS models presented here explain more than 56% of the variance in the EQ-5D scores. These are comparable with the results (63% [19], 58% [17], and 58% [20]) obtained when mapping the SF-12 summary scores onto the UK-based EQ-5D preference-based utility index using OLS regressions. Sullivan et al. reported values of 58% for an OLS model and 92% for a Tobit model involving 18 independent variables including the SF-12 summary scores and sociodemographic variables, such as ethnicity and education, when mapping onto the US preference-based EQ-5D index [21]. Nichol et al. used the OLS regressions to derive a model that explained 51% of variance in the HUI2 using age, sex, and the eight dimension scores obtained from the SF-36 responses (n = 6921) as the independent variables [10]. The reported variance explained in the models mapping from the SF-36 eight dimension scores to the HUI indexes range from 37% [11] to 53% [13].

The results of this study are also comparable with the published evidence when assessing the residuals in the predicted scores. The MAEs for our models ranged from 0.130 to 0.1340 for the individual predictions, from 0.045 to 0.083 for the within-sample subgroup analyses and from 0.0641 to 0.099 for the out-of-sample scores. Sullivan et al. reported MAEs for the individual US weighted predictions ranging from 0.0726 for a censored least absolute deviations (CLAD) model to 0.0765 for a Tobit model [21]. Franks et al. reported an MAE ranging from 0.381 for EQ-5D scores smaller than 0 to 0.070 for EQ-5D scores greater than 0.9 [19]. Gray et al. reported an MAE of 0.108 for within sample predictions and 0.115 for out-of-sample predictions when presenting the results of a study mapping responses in SF-12 data to EQ-5D responses [22].

While several of the published studies validated results using out-of-sample data sets, to our knowledge, this is the first study which uses published data sets to predict mean preference-based utility scores using mean cohort scores for the independent variables. The errors observed at the individual level still exist when applying the algorithms to summary data, but the magnitude of the errors is much smaller. When using model EQ (2), 49 (78%) of the 63 predicted out-of-sample mean EQ-5D scores were accurate to within |0.074| the established MID for the EQ-5D [37]. We are also the first to report in terms of accuracy when predicting incremental values. Looking at the differences between the study cohorts, 25 (81%) of the 31 predicted values for Model EQ 1) were within the MID. Looking at changes over time, 20 (83%) of the 24 predicted values for models EQ (1), (2), and (3) are within the MID.

The maximum range (1.206) in the predicted individual EQ-5D scores is obtained from model EQ (7), while the minimum range (1.004) is obtained from model EQ (3). These cover 76% and 64% of the actual EQ-5D index (−0.594 to 1.0), respectively. Using ages of 18 and 99 years, the largest possible range is 1.331 (−0.296 to 1.034) using model EQ (6), while the smallest range is 1.047 (0.023–1.070) using model EQ (3). When subgrouping the out-of-sample data sets by actual EQ-5D score; 15 (24%) of 63 have scores greater than 0.70 and smaller than 0.80, while 7 (11%) of 63 have scores greater than or equal to 0.80. The MEs (MAE) in the predicted values for these subgroups when using model EQ (1) were −0.001 (0.021) and −0.022 (0.028), respectively. Between 87% and 100% of values for these subgroups were within the MID irrespective of the model used.

Looking at the out-of-sample data sets at the lower end of the EQ-5D scale, five cohorts have a mean EQ-5D score smaller than or equal to 0.175, and five cohorts have a mean EQ-5D score greater than 0.175 and smaller than 0.35. The MEs in the predicted values for these two subgroups when using model EQ (1) are 0.285 and 0.158, respectively. The magnitude of the errors in predicted values for the cohorts with very low EQ-5D scores was problematic. The models produced estimates that are regressed to the mean, and this is exacerbated because the distribution of the EQ-5D scores in the data set used in the regressions was skewed with a comparatively small proportion (848 of 6350) of individuals scoring below 0.5. In addition, the relationships between the EQ-5D values and the dimension scores were weaker for the data points at the lower end of the index (Table 1). Caution should be taken when applying the algorithms to data sets likely to have very low utility values. Further research involving a greater proportion of individuals with severely impaired health states is required to determine whether separate models for cohorts with very low HRQoL would be beneficial.

Literature describing the relationships between the EQ-5D index and the two SF-12 summary scores continues to grow. Nevertheless, to our knowledge, there is no published methodology that can be used to convert reported mean dimension scores from the full health profile of the SF-36 into an EQ-5D utility score. It is possible that the use of the eight dimension scores, as opposed to the two overall scores, could produce an algorithm with the ability to describe smaller differences and changes. Nevertheless, further research is required to confirm this.

There is no evidence that sex adds to the variance explained by the models. The results suggest that the dimension scores RP and VT add little to either the goodness of fit or the accuracy of the scores generated by the models. This probably reflects the fact that EQ-5D does not contain a VT domain. Likewise, while the inclusion of the significant squared terms increases the variance explained by the model slightly, the predictions for the out-of-sample values are less accurate. Overall, there is very little to choose between the goodness of fit and the accuracy of the predictions generated by the models presented here. Based on the out-of-sample validations, we advocate model EQ (1) as the first choice for predicting mean EQ-5D scores from mean dimension SF-36 scores when patient level data are not available. Nevertheless, when comparing incremental differences between study arms or changes over time, model EQ (4) is the preferred choice.

In conclusion, we have found that the results of a simple OLS regression can be used to transform the mean dimension scores from the SF-36 questionnaire into mean preference-based EQ-5D scores. The model is reasonably accurate when predicting out-of-sample data including absolute values, incremental changes between study arms, and incremental changes over time. The results suggest that the algorithm could be used to populate health states in economic models when EQ-5D data is not available. Nevertheless, mapping is always second best to either using the EQ-5D directly in clinical studies or to obtaining preference weights for the nonpreference measure where the EQ-5D is not appropriate for the condition. Further research is required to refine and improve on the models presented here.

Source of financial support: None.

Ancillary