Mapping between EQ‐5D‐3L and EQ‐5D‐5L: A survey experiment on the validity of multi‐instrument data

Abstract EQ‐5D is a 5‐item questionnaire instrument designed to measure health‐related quality of life. It is extremely important, since it is used to measure health benefits in many studies providing evidence for reimbursement decisions by the National Institute for Health and Care Excellence in England and similar policy bodies in other countries. EQ‐5D has been redesigned in a more detailed form (EQ‐5D‐5L), but much existing cost‐effectiveness evidence is based on the older version (EQ‐5D‐3L). Statistical mapping from one version to another is widely used, exploiting data from multi‐instrument surveys incorporating both variants. However, little is known about the robustness of data from such multi‐instrument surveys. We design a randomized experiment to investigate whether inclusion of both versions at different stages in a single interview gives a reliable picture of the relationship between health measures from the two instruments and embed it in individual interviews from the UK Understanding Society household panel. We find that sequencing of the two versions of EQ‐5D within an interview has a significant impact not only on the resulting data but also on the estimated mapping models. We illustrate the non‐negligible effects in two real‐world cost‐effectiveness examples and discuss the implications for future multi‐instrument survey design.

of a sample of the general population for the UK. The resulting set of values has an upper bound at one representing full health with zero being equivalent to being dead. Values below zero denote health states that are considered to be worse than being dead. Although simple, EQ-5D has become extremely important for policy purposes. Economic evaluations providing evidence for decisions on medical technologies designed to improve health such as those used by the National Institute for Health and Care Excellence (NICE) in England and similar policy bodies in other countries use the quality-adjusted life year (QALY) as their measure of benefit. The QALY reflects concerns for both the quality and length of life and it is based on patient reported outcome measures such as EQ-5D, the preferred health benefit measure of NICE for its technology appraisals (NICE, 2013).
EQ-5D has been redesigned (Herdman et al., 2011) in a more detailed form known as EQ-5D-5L (shortened to 5L henceforth), adding two more levels in between levels 1 and 2, and 2 and 3 of the original 3L version. Recent work by the NICE Decision Support Unit (Hernández-Alava, Pudney, Pennington et al., 2018) has revealed substantial differences between the way in which 3L and 5L estimate gains for health technologies in terms of Quality Adjusted Life Years (QALYs) and, in turn, their cost-effectiveness. These differences originate in the responses individuals give to the two descriptive systems and also the valuations assigned to health states by the available utility scoring systems. These measurement conflicts create a real policy dilemma facing health authorities in recommending one or other of the two versions of EQ-5D. Whichever is recommended, there exists a stock of existing evidence generated in 3L terms and a growing body of 5L-based evidence that will need to be reconciled or combined in decision models for many years to come: a 5L value set will not make 3L evidence redundant. It is highly likely that there will also be new extensions and variants of measures developed in the future for which the same issues will apply. Statistical mapping from one version to another is widely used to negotiate these obstacles, exploiting multi-instrument (MI) surveys which incorporate both variants (see , Pennington et al., 2018 van Hout & Shaw, 2021, van Hout et al., 2012 for examples of mapping between 3L and 5L). 1 Two existing datasets have been identified previously for mapping between the two versions of EQ-5D but both have limitations in terms of sample size and study design for statistical mapping . FORWARD, or the National Databank for Rheumatic Diseases (Wolfe & Michaud, 2011), is a register of patients with rheumatoid disease primarily recruited by referral from US and Canadian rheumatologists. Being disease-specific, its widespread use requires the assumption that the statistical mapping model is appropriate across disease areas. The second study, co-ordinated by the EuroQol Group, was carried out in six countries and included eight patient populations, 2 with a student cohort representing the healthy population. Although the dataset includes more than one disease area, it was not chosen to be representative of any specific population and thus suffers from the same problem of external validity as the FORWARD dataset. There are additional issues in terms of study design detailed in . Given these concerns about the currently available MI datasets and the importance of mapping for policy purposes, a large-scale data collection designed for the purpose of mapping that can be used for decision making in the UK has been recommended .
Analysis by Hernández Alava et al. (2020, chapter 5.1) has produced evidence suggesting a low degree of empirical consistency between the orderings of health states by the 3L and 5L instruments when both are implemented within an MI survey. There are three obvious possibilities for the source of this inconsistency: conceptual (respondents interpret 3L and 5L as asking about different health concepts); or random reporting error (respondents are clear about the concept but make idiosyncratic "errors" in reporting); or repetition bias (response behavior is changed systematically by the experience of being asked similar questions earlier in the same interview).
The last of these could arise in several ways that are extensively discussed in the survey methods literature (see Schuman & Presser, 1996 for discussion of the effects of question ordering and context on response behavior). However, that literature is mostly concerned with the design of questionnaire modules containing a set of questions about specific aspects followed or preceded by an overall summary question (see McFarland, 1981, e.g.). Instead, we are concerned here with an issue that has had very little research attention: the ordering of two versions of a multi-aspect instrument.
If MI survey responses are systematically distorted by behavioral effects induced by near-repetition, then the use of statistical mapping based on such reference samples is open to question, and there is an urgent need to investigate this using appropriately designed experiments. More constructively, good experimental evidence on the behavior of respondents in MI surveys may give us a basis for recommending survey designs that improve the reliability of MI data.
In this study, we investigate the reliability of the MI survey design by developing and analyzing data from a randomized experiment designed to investigate whether inclusion of both versions of EQ-5D at different stages in a single interview gives a reliable picture of the relationship between health measures based on the two instruments. We used the Understanding Society general population panel survey-also known as the UK Household Longitudinal Survey (UKHLS)-as the vehicle for our experiment.

| The experiment
Our experiment was carried in wave 11 of the UKHLS Innovation Panel, with fieldwork conducted in June to September 2018. 3 The UKHLS is a nationally representative panel study with a two-stage stratified random sample of the household population. It aims to interview every adult (aged 16 and over) household member annually, and the Innovation Panel is a subset of the full panel (2898 achieved adult interviews in 1856 households at wave 11) which is reserved for experimentation of various kinds (Jäckle et al., 2018).
Our experimental design randomized the sample of households into three approximately equal-sized groups, and every individual within a household was allocated the same experimental treatment. The EQ-5D instruments 4 were embedded in the general individual interview (lasting approximately 20 min on average), which covered a wide range of socioeconomic topics. One group received the 5L module early (at approximately the halfway point in the interview) and the 3L module later (approximately three-quarters of the way through the interview) 5 ; the second group received the 3L instrument early and the 5L instrument later; and the third group received only the 5L instrument, at the later point in the interview. We refer to these groups henceforth as 5L3L, 3L5L and 5Lonly. The two variants of EQ-5D were separated by numerous survey questions, with a wide range of topics and response options. Not all respondents answered exactly the same questions, as some are context specific (e.g., retired individuals were not asked about job details).
Fieldwork used a multimode strategy, with a random two-thirds of the panel initially offered a web interview (CAWI) and the remainder by computer-assisted self-interviewing (CASI) embedded within the main face-to-face interview. 6 Assignment to interview mode groups was independent of assignment to EQ-5D treatment groups, but individuals could opt for other interview modes and overall tended to prefer conventional interviews: 55% of achieved interviews were by CASI and 45% by CAWI. 7 The achieved sample of individuals who gave full responses to the 5L (3L) instrument comprised 2635 (1732) individuals, clustered in 1689 (1278) households. We tested the balance of the sample across treatment groups using Fisher's exact test of independence, based on cross-tabulations, constructed using only the first respondent individual from each household to avoid intra-household correlation. We found no flaws in the randomization with respect to gender (p = 0.52), age group (p = 0.15) or existence of a limiting illness or disability (p = 0.37). The incidence of CAWI rather than CASI interviewing was also balanced across treatment groups (p = 0.95).

| Sample characteristics
Since the UKHLS is a general population survey, many respondents reported no current health problems: the sample proportions of those reporting the best health state (11111) are 50% and 41% for 3L and 5L, respectively. However, the sample does have coverage of poorer health states. In their responses to a separate question, 36% of the sample reported a long-standing illness or disability (LSI); among that group, the sample proportions of state (11111) fall to 17% (3L) and 14%. (5L) Appendix Figures A1 and A2 show the distributions of overall summary measures of 3L and 5L responses in the whole sample and the subset of respondents reporting a LSI. For the "misery" measure (the unweighted sum of the scores in each domain) the LSI group have 21% or 31% more misery than the sample average for the 3L and 5L instruments, respectively. Using the Dolan (1997) and Devlin et al. (2018) utility scores, the LSI group have 20% or 15% less utility on average.
The number of distinct health states reported by the 1732 respondents who completed 3L fully was 64 (26% of the 243 logically possible 3L states). For the 2635 respondents completing 5L, the number of distinct states reported was 269 (9% of the 3125 possible). These numbers compare with the 123 (51%) and 676 (22%) distinct states found for 3L and 5L in the EuroQol MI dataset and 136 (56%) and 960 (31%) for the Jan 2011 FORWARD dataset analyzed by . The number of distinct own health states reported by experimental subjects in the 5L valuation study of Devlin et al. (2018) was 180 (6%; Hernández-Alava, . Thus the coverage of health states by UKHLS is less wide than that of disease-linked MI datasets, but it is comparable to or more extensive than coverage by other datasets used for valuation purposes.

| Analytical methods
We analyze the data in five steps. First, we examine the awareness by respondents of the repetition. As part of our design, CASI interviewers were asked to assess the reactions of respondents in the 5L3L and 3L5L groups to receiving repeated EQ-5D instruments, distinguishing between those who appeared not to notice the repetition, those who noticed but appeared unconcerned, and those who expressed dissatisfaction or other concern about the repetition. We use ordinal response models to examine whether awareness is uniform across respondent groups.
Second, we test the hypothesis of equal response distributions across the 3L5L, 5L3L and 5Lonly groups, using domain-specific χ 2 tests and a joint test across all domains, applied to each pair of treatment groups. The joint test is useful as it avoids the pitfalls of multiple testing, but the domain-specific tests are valuable to identify the sources of significant overall difference. Define p g as the vector of sample proportions from group g ∈ {3L5L, 5L3L, 5Lonly} of the first k-1 response options, where k is the number of levels (3 or 5). The test statistic for comparison of groups g and h is: . Under the assumption of strictly positive population probabilities and the null hypothesis of identical response probabilities, this has a limiting χ 2 (k-1) distribution as n g , n h → ∞ with n g /n h → constant. For some domains there were no responses at the bottom of the response scale. In these cases, we apply the test with the bottom two categories merged and the degrees of freedom therefore reduced by 1. Asymptotically, the test remains valid when implemented in this way, whether or not the population probability of the bottom category is strictly positive. We use only one respondent per household (the first one interviewed), to ensure independence.
The third step is to examine the prevalence of low-3L/high-5L or low-5L/high-3L conflicts within the 3L5L and 5L3L groups, using binary probit models to identify types of respondent particularly prone to such conflicts.
Although instruments like EQ-5D are designed for practical policy use, research on health outcome measurement often stops short of examining the impact of measurement differences in practice. In fact, statistically significant differences between the three experimental groups may or may not matter for practical purposes, so our fourth step is to examine the impact of between-group differences on statistical mapping, which is the main practical use of MI surveys. We estimate separate statistical mapping models using data from the 3L5L and 5L3L experimental groups and test the hypothesis that the resulting mapping functions are identical. The Dolan (1997) utility tariff allocates utility scores (U 3 ) in the range −0.594 to 1, where 1 indicates the best possible level of health-related quality of life and negative values indicate a state perceived as worse than dead. The distribution of 3L utility scores for the 5L3L and 3L5L groups combined and for the subsample of people with a LSI are shown in Figure A1a, 1b of the Appendix. The distribution is highly skewed, there is a discrete probability mass at the extreme value U 3 = 1, and there are large gaps in the distribution-most importantly the absence of any utility scores in the interval (0.883, 1). A flexible approach to modeling is required if the distribution of 3L utility scores is to be captured adequately and models based on mixtures of generalized Tobit-type components have been found to work well. 8 We use the implementation of this idea developed by Hernández and Wailoo (2015), which involves a set of components labeled c = 1 … C: where is a vector of predictor covariates, is a coefficient vector and ∼ ( 0, 2 ) a Gaussian random error, specific to component c. The probability of individual i belonging to the c-th latent sub-population has multinomial logit form: where is a vector of covariates and one of the coefficient vectors 1 … is normalized to zero. The utility score for the observed 3L outcome is then assumed to be: Estimation of this structure is by maximum likelihood, and a combination of the Bayesian Information Criterion, mean absolute error, root mean squared error and graphical methods  are used to choose the number of components. Using a flexible model ensures that any differences in the mapping models are due to differences in the data (survey design) rather than model misspecification.
The fifth strand of our analysis summarizes the practical importance of differences between the mapping functions estimated from the 3L5L and 5L3L groups by illustrating the different results they produce in simple cost-effectiveness calculations based on data from two real-world clinical trials. The intention here is not to revise or make comparisons with the results of those studies, but to summarize the differences between our 3L5L and 5L3L experimental groups using a metric that is directly relevant to policy applications.

| Do interviewees notice repetition?
Among the 863 respondents who received both 3L and 5L by CASI and for whom there is complete personal data, the reactions to receiving repeated EQ-5D instruments were found to be independent of gender and the 3L/5L ordering. 9 But, as Table 1 shows, there was a large age difference in the reactions, with people aged over 60 more than twice as likely as those aged under 35 to express awareness and concern over the repetition. Table 2 gives estimates, based on separate ordered probit models for the 5L3L and 3L5L groups, of the estimated impacts on the probabilities of reactions to repetition of a wider range of personal characteristics. The large significant age effects are confirmed, but reactions to repetition are unrelated to gender or the individual's health status, as represented by self-reported LSI. There is no significant evidence of a difference between the 5L3L and 3L5L groups in respondents' awareness of the repetition. This might be seen as surprising, since the 5L3L group involves following a detailed instrument (5L) by a less detailed one (3L) which could seem less justifiable to respondents than the reverse ordering used in the 3L5L group. Pr (noticed, unconcerned)

Pr (noticed with concern)
LSI −0.001 ( Coefficient significance: * = 10%, ** = 5%, *** = 1%. T A B L E 2 Estimated marginal effects on probabilities of reactions to repetition of EQ-5D module (ordered probit model) 3.2 | Are EQ-5D responses significantly affected by timing and sequencing? Table 3 reports the results of domain-specific χ 2 tests of the hypothesis of equal response distributions between pairs of treatment groups, and the overall χ 2 test spanning all five EQ-5D domains. We apply these tests separately to the subsamples of people who do not report a LSI and those who do (and who may therefore be more representative of participants in some clinical trials).
The overall χ 2 test for responses to the simpler 3L instrument finds no significant evidence of a difference between the 5L3L and 3L5L groups for respondents without any LSI. However, among people who do report an LSI, we find a significant difference (at the 5% level) between the 3L response distributions. Domain-specific tests suggest that the source of the differences are the mobility and pain domains.
Comparisons of the 5L response distributions reveal significant differences for 5L3L versus 3L5L (LSI and no-LSI) and for 3L5L vs. 5Lonly (no LSI). 10 In all of these cases, the pain dimension is the primary source of the difference, and there is also some domain-specific evidence of a difference in the pain responses for 5L3L versus 5Lonly (LSI). Only in one other case (3L5L vs. 5Lonly, anxiety/depression domain for non-LSI respondents) is there a significant difference at the 5% level.
The pain dimension of EQ-5D should clearly be the main focus of concern for consistency between measures, and this tends to confirm the conclusions of consistency analysis of the existing FORWARD and EuroQol Group datasets (Hernández-Alava & Pudney, 2017). The structure of the Dolan (1997) utility tariff used in many UK cost-effectiveness studies assigns relatively high weight to the pain dimension, 11 which means that pain may often be pivotal in determining the outcome of cost-effectiveness studies. The significant response difference we find for the pain domain is therefore a potentially important policy issue.  show the direction of the differences between the group-specific response proportions in the pain domain, for those cases where the χ 2 test led to a result significant at the 5% level. 12 We begin with responses to the 5L instrument. Statistical significance: * = 10%, ** = 5%, *** = 1%.

T A B L E 3 χ 2 tests for equality of response distributions across treatment groups by domain of EQ-5D (p-values in parentheses)
Figures 1 and 2 compare the 5L3L and 3L5L groups, for LSI and no-LSI respondents, respectively. The 5L3L group gives a significantly lower sample proportion for response 1 (no pain) than the 3L5L group, and thus paints a more negative picture of pain. A possible behavioral interpretation of this difference relates to two factors, sequencing and timing. It is well-established in the survey methods literature that response behavior can be influenced by the existence and nature of questions asked earlier, so the different sequencing of questions in the 5L3L and 3L5L groups may generate behavioral differences. For example, the 3L5L group may deliver a relatively positive assessment because 5L is preceded by 3L and the shorter response scale of 3L has the effect of sensitizing respondents particularly to the upper part of the 5-point scale for 5L.
The timing of a question may also be an important influence because respondents learn or become fatigued or disengaged as the interview progresses, so that questions generate different responses if asked late in the interview than they would if asked early. A possible interpretation alternative or complementary to one based on sequencing is that the later timing of 5L finds respondents in the 3L5L group more fatigued or disengaged and thus more likely to check the first box they see in the response scale, which happens to be "no pain." This explanation is certainly consistent with the evidence of Figures 1-3 but not, on its own, with Figures 4 and 5. Figures 3 and 4 show the other significant differences in 5L pain responses. Figure 3 (5L3L and 5Lonly respondents with LSI), shows a worse pain state on average for the former group, while Figure 4 (3L5L and 5Lonly respondents with no LSI) shows a worse pain state in the latter group. Since 5L3L and 5Lonly differ only in terms of timing and 3L5L and 5Lonly differ only in terms of the existence of a preceding 3L instrument, the differences in Figures 3 and 4 are again consistent with the interpretation that later timing and previous exposure to the 3L instrument both tend to promote choice of the first "no pain" response point in 5L. Thus the significant between-group differences in 5L responses are all consistent with a simple behavioral interpretation that runs in terms of sequencing and timing effects. 13 Figure 5 compares the (significantly different) 3L responses from the 5L3L and 3L5L groups for people who reported having an LSI. As was the case for 5L responses, it reveals a lower proportion of "no pain" responses in the 5L3L group than the 3L5L group. However, this implies a different pattern in terms of sequence and timing than the pattern for 5L F I G U R E 1 Sample proportions of responses to the 5L pain question in the 5L3L and 3L5L groups from respondents with long-standing illness or disability (LSI; 95% confidence ellipses; labels: 1 = no pain, …, 5 = extreme pain) [Colour figure can be viewed at wileyonlinelibrary.com] F I G U R E 2 Sample proportions of responses to the 5L pain question in the 5L3L and 3L5L groups from respondents with no long-standing illness or disability (LSI; 95% confidence ellipses; labels: 1 = no pain, …, 5 = extreme pain) [Colour figure can be viewed at wileyonlinelibrary.com] responses. Within the 3L5L group, 3L is asked early and is not preceded by another variant of EQ-5D. Consequently, the effect on 3L responses of preceding 3L by 5L is the mirror image of the effect on 5L responses of preceding 5L by 3L.
How should we interpret this puzzling finding? It may hinge on the look of the CASI or CAWI screen that shows the response scale to respondents (see Figure A7 in the appendix). We have found that 5L preceding 3L tends to promote use of lower points (worse pain states) in the 3L response scale, while 3L preceding 5L tends to promote use of the upper points (better pain states) in the 5L scale. The 3L screen has a shorter scale that occupies the upper part of the screen, and perhaps sensitizes the respondent to the upper part of the 5L response scale when it eventually appears. Conversely, the longer 5L scale stretches deeper down the screen and perhaps sensitizes the respondent to the lower part of the 3L scale when it appears. If this interpretation is correct, it is remarkable, since there is a substantial gap between the early and late placement of the instruments. It would also have important implications for the design of MI surveys, giving some grounds for use of randomized mixtures of ascending and descending or horizontal and vertical display of response scales. Table 4 shows the nature of these response differences in terms of average differences between treatment groups in overall summary health measures constructed from the EQ-5D responses. Although the sample is balanced across treatment groups, we estimate the treatment effects using the inverse probability weighting estimator, adjusting for differences with respect to age and gender. The two summary measures are a utility score (taken from Dolan, 1997 for 3L and Devlin et al., 2018 for 5L) and the "misery" index formed by summing numerical responses to the five EQ-5D items and rescaling to put the sum on the [0,1] interval. Table 5 summarizes the sample proportions of cases in which there is evidence of conflict between the health states described by the 3L and 5L responses. For any one of the five health domains, we define a low-3L, high-5L conflict as any F I G U R E 3 Sample proportions of responses to the 5L pain question from respondents with long-standing illness or disability (LSI) in the 5L3L and 5Lonly groups (95% confidence ellipses; labels: 1 = no pain, …, 5 = extreme pain) [Colour figure can be viewed at wileyonlinelibrary. com] F I G U R E 4 Sample proportions of responses to the 5L pain question from respondents with no long-standing illness or disability (LSI) in the 3L5L and 5Lonly groups (95% confidence ellipses; labels: 1 = no pain, …, 5 = extreme pain) [Colour figure can be viewed at wileyonlinelibrary. com] case where the 3L response is at 1 and the corresponding 5L report at 3 or more, or where the 3L response is 2 and 5L is 5. 14 Such a conflict means that the 5L instrument generates an implausibly poor health state relative to 3L. A high-3L, low-5L conflict is a case where the 3L response is 3 and the 5L instrument gives 1-3, or where the 3L response is 2 and 5L is 1. Table 5 shows that our experiment yielded very few cases where the 3L instrument indicated good health and the 5L instrument conflicting poor health. However, there were rather more cases (around 5% of the groups that received both 3L and 5L), where the 5L instrument gave an apparently conflicting good health state for at least one of the five domains. This approximately 5% conflict rate is identified with good precision by the experimental comparison and is similar for both treatment groups.

| Conflicts between 3L and 5L responses
Although a 5% conflict rate may not appear very large, it could be large enough to cause substantial problems in certain types of statistical modeling 15 and, when converted to utility form, in cost-effectiveness calculations. Table 6 shows the results of probit models for the occurrence across individual respondents of any type of 3L-5L conflict. The first model is based on the whole sample from the 5L3L and 3L5L groups and allows for the impact of web interviewing on the conflict rate; the second model restricts the sample to respondents interviewed by CASI for whom interviewers could record whether repetition appeared to have been noticed by the respondent. We find a strong impact of health, with LSI raising the rate of conflicting responses by approximately 4 percentage points. There is statistically significant evidence of a lower rate of conflict (by 2.4 percentage points) in web interviews, which may reflect the characteristics of respondents willing to use the web rather than the CAWI mode itself. There is only weak evidence (p = 0.111) of a relationship between conflicting responses and awareness of repetition, although the point estimate is large-an average increase of 2.4 percentage points in the conflict rate for respondents who show no sign of awareness of the repetition. There are no significant differences between the 5L3L and 3L5L groups or between demographic groups. Statistical significance: * = 10%, ** = 5%, *** = 1%.

T A B L E 4
Average treatment effects for utility score and misery index summary measures (inverse probability weighting estimators with age group and gender covariates)

| BUT DOES IT MATTER? GROUP DIFFERENCES EXPRESSED IN A COST-EFFECTIVENESS METRIC
The comparison of responses between treatment groups suggests that there are significant biases in EQ-5D measures produced by repetitive MI surveys. These distortions have the potential to induce measurement error bias into the predictive models that are used for mapping between 3L and 5L quality of life measures. But a statistically significant bias is not necessarily large enough to cause important problems for policy work. To give some insight into this, we use two recent cost-effectiveness studies to illustrate the response differences revealed by the UKHLS experiment using a framework more directly related to practical applications. Statistical significance: * = 10%, ** = 5%, *** = 1%.

T A B L E 6
Marginal effects on the probability of conflicting 3L and 5L responses (probit models)

| Mapping models
The original 3L, combined with the utility scores constructed by Dolan (1997), currently remains the version preferred by the NICE as a basis for cost-effectiveness work in England. For clinical trials which collect responses from trial subjects using the updated 5L instrument, mapping to the 3L instrument is currently mandated by NICE (2013). It is therefore of interest to use the impact on mapping models as a summary of the experimental group differences we have found. A statistical mapping model predicts the 3L responses that would have been observed, had the 3L version been implemented, using the 5L responses as predictors. In our illustrative application, we use the Hernández-Wailoo (2015) mixture model described in Section 2.3. After extensive specification search, 16 the final vector of predictors contained an intercept and the five items of 5L representing mobility, self-care, usual activities, pain and anxiety/depression, each treated as numerical variables on the scale 1 … 5. The predictor for group membership contained an intercept, a binary variable identifying individuals who reported the best possible health state 11111, and the 5L misery index formed from summing the domain indicators, giving a single variable with range 5 … 25.
Following a standard specification search carried out independently on the 5L3L and 3L5L samples, we arrived at a similar three-component specification for each. Component c = 1 was a degenerate discrete probability mass at = 1 , with two other regular Tobit-type components. The parameter estimates are shown in Table 7, together with the predicted mean utility scores conditional on group membership and the mean membership probabilities.  Statistical significance: * = 10%, ** = 5%, *** = 1%.

T A B L E 7 Parameter estimates of mixture models for 3L utility
Although the models have the same three-component structure, the estimates are quite different. A likelihood ratio test comparing the two separate models to the same structure fitted to the combined 5L3L + 3L5L sample gives a highly significant rejection of the hypothesis of equal parameters (χ 2 (20) = 84.4, p = 1.4 × 10 −9 ). The 5L3L model has a probability mass of 0.37 at perfect health state U = 1 and two Tobit components with similar mean predictions (0.799 and 0.847) but with error variance differing by a factor of (0.174/0.041) 2 = 18.0. The 3L5L model has a slightly larger discrete probability mass at U = 1 and two other components with quite distinct mean predictions (0.806 and 0.366) and error variances that differ by a smaller margin of (0.109/0.052) 2 = 4.4. In summary, it is clear that the significant differences we found in the 5L3L and 3L5L data on EQ-5D have significant and structurally important implications for mapping models estimated from those data.

| Cost-effectiveness calculations
We put the differences in the mapping functions estimated from the 3L5L and 5L3L treatment groups into policy-relevant form using two illustrative cost-effectiveness applications. These cannot claim to be representative of the broad range of such studies considered by NICE, but they serve to illustrate the way that measurement differences may affect the key calculations made in cost effectiveness studies. The two applications meet our requirements in that they are based on individual-level trial data incorporating the newer 5L version of EQ-5D, and their authors kindly agreed to re-run the cost-effectiveness calculations for us. It is important to note that the results of these recalculations are simplified and not comparable with the authors' own findings-they relate to a small part of the evaluation process and are based on different mapping approaches. It should also be noted that they are both trials of interventions in health conditions that normally affect quality, rather than length, of life.
The TITRATE randomized control trial (RCT) was concerned with the management of established moderate cases of Rheumatoid Arthritis receiving stable doses with conventional drug therapies. The trial compared the outcomes of intensive case management (monthly assessments and support by trained nurses) in 169 cases with outcomes in a control group of 167 cases receiving standard care. The aim was to test the hypothesis that intensive management increases the rate of remission in such patients. The study was implemented in 39 English rheumatology centers. Health utilities were derived from participant responses to 5L at baseline, 6 and 12 months. Details of the trial and the economic evaluation are reported in Scott et al. (2020).
The HubBLe trial was a two-arm RCT conducted in 17 acute UK hospitals from September 2012 to August 2015. The trial recruited a total of 372 patients with grade II or early grade III hemorrhoids (piles that prolapse but either spontaneously reduce or require minimal manual replacement). Patients were randomly assigned to either Haemorrhoidal Artery Ligation (HAL, n = 185) or more costly Rubber Band Ligation (RBL, n = 187) and followed for up to 12 months. The 5L measurements were made at baseline, 1 day, 7 days, 21 days, 6 weeks, and 12 months. Full details of the trial are reported by Brown et al. (2016) and an economic evaluation conducted alongside the trial is reported in Alshreef et al. (2017).
Let treatment 1 be the basic treatment (often the clinical status quo), and treatment 2 an alternative treatment hypothesized to be clinically superior. In outline, a cost-effectiveness study involves the following steps: (i) For each trial participant j, use 5L to measure the sequence of quality-of-life outcomes over time the trial period.
(ii) For each 5L measurement, use the mapping model to predict the expected value of the corresponding unobserved 3L utility outcome. (iii) Convert the resulting sequence of predicted 3L quality-of-life measures to the metric of quality-adjusted life years (QALY), . (iv) Measure the total cost of treatment, for each participant as a present value over the trial period.
(v) Calculate the differences in mean cost and QALY outcomes between participants receiving treatments 1 and 2 as Δ = 2 − 1 and Δ = 2 − 1 . (vi) Calculate the Incremental Cost-Effectiveness Ratio (ICER), giving the cost per additional QALY delivered by treatment 2 beyond the benefit achieved with treatment 1: Treatment 2 is then recommended as an alternative to treatment 1 if the ICER is below the level that NICE regards as an affordable cost (normally in the region of £20,000-30,000 per additional QALY). If the ICER is above this threshold, then it is regarded as not cost-effective, even if it offers a significantly better outcome in QALY terms. Table 8 summarizes the results of re-doing the cost-effectiveness calculations for TITRATE and HubBLe, 17 using the two alternative mapping models estimated separately from UKHLS 5L3L and 3L5L experimental data.
In the TITRATE trial, use of the 5L3L mapping rather than the 3L5L mapping gives a 13% higher estimated mean QALY for both the standard and intensive treatments. However, because the impact is fairly uniform across the two treatments, the QALY difference between the two treatments is only 8% higher for the 5L3L mapping than the 3L5L mapping. This in turn means that the estimated ICER is lower by about 7% when the 5L3L mapping is used.
In the HubBLe trial, the 5L3L mapping again leads to higher QALY outcomes than the 3L5L mapping, although only by 4% for both the RBL and HAL treatments. The HAL-RBL QALY difference is 5% higher for the 5L3L mapping, giving an estimated ICER lower by 5%: approximately £77,000 rather than £73,000, but still well above the NICE funding threshold.

T A B L E 8 HubBLe and TITRATE cost-effectiveness studies
Both TITRATE and HubBLe share a feature that is likely to be found in a wide range of applications. Although error in the mapping procedure may generate large biases in the mean QALY estimates 1 and 2 , those biases tend to cancel out to some degree, giving a smaller range of error in the QALY difference Δ . However, against this, the QALY difference Δ is the denominator of the ICER and, in evaluations where a new technology offers only marginal improvements, that denominator may be quite small. In such cases, even small biases in Δ may have a large impact on ICERs and thus on cost-effectiveness decisions.
The HubBLe example also illustrates a further important point-that the type of mapping model used and the dataset it is based on may have a much bigger impact on the ICER than the response errors induced by repetitive interviewing in a MI survey. The published cost-effectiveness result for HubBLe uses a simpler mapping model devised by van Hout et al. (2012) than our Tobit-mixture model. That mapping led to an estimated ICER of £ 90,688 in the complete case analysis (Alshreef et al., 2017). That disagreement is much larger than the difference between the ICERs for the 5L3L and 3L5L group-specific mappings (£73,009 and £76,858).

| CONCLUSIONS AND IMPLICATIONS FOR FUTURE RESEARCH
There are many survey instruments used to measure health-related quality of life outcomes as a basis for the evaluation of medical procedures and pharmaceutical products. Statistical mapping often has to be used to translate outcomes measured by one instrument into an evaluation metric developed in relation to a different instrument. This process requires a multi-instrument survey that allows the statistician to observe responses to both instruments from the same set of survey participants. It also entails a reliability assumption: that participants give essentially the same responses in a multi-instrument interview as they would in a conventional interview involving only one of the instruments. Despite the importance of statistical mapping for the policy analysis carried out by official bodies like NICE in England, there has so far been little research aiming to test the hypothesis that multi-instrument surveys are reliable in this sense.
In this paper, we have exploited a randomized experiment in the UKHLS Innovation Panel incorporating two variants of the widely used EQ-5D instrument (3L and 5L) differing primarily in the number of levels in the response scales. The experiment compared three treatment groups: 5L followed later in the interview by 3L; 3L followed by 5L; and 5L alone; to investigate the MI reliability hypothesis and explore its consequences for economic evaluation. There are seven main findings.
First, despite a relatively large separation in time between the two instruments in the interview, there was evidence that a majority of respondents were aware of the (near-)repetition, with older people more likely to notice and express concern. This may be because older people were more engaged with the survey, rather than any inherent age difference in attention span or recall.
Second, there were statistically significant differences in both the 3L and 5L response distributions between 5L3L and 3L5L, so the ordering of the two instruments has a significant impact. There were also significant differences between the comparisons of the 5L3L and 3L5L groups with the 5Lonly group. The strongest evidence of difference was found in the EQ-5D domain relating to pain, which is the most highly weighted domain in the most widely used utility scoring systems for the UK.
Third, for the 5L instrument, these significant differences at the level of individual components of EQ-5D also carry over to overall summary measures of health-related quality of life, specifically the "misery index" defined as the sum over the five health domains of EQ-5D, and the utility score proposed by Devlin et al. (2018). Evidence of an impact of question ordering on summary measures constructed from the 3L responses is less clear-cut.
Fourth, the nature of the ordering effects for questions on pain is quite complex. We have found that, if 5L is preceded by 3L, then it generates a higher proportion of favorable "no pain" responses relative to asking 5L alone or before 3L. Conversely, when 3L is preceded by 5L, use by respondents of the "no pain" option is reduced relative to the responses from people who are asked 3L first. A possible interpretation is that asking 3L (with its short response scale) first visually primes respondents to the first few points on a subsequently encountered 5L scale; while asking 5L (with its longer scale) primes respondents to use later points on a 3L scale encountered subsequently.
Fifth, strongly conflicting responses (mostly where 3L indicated very poor health and 5L indicated very good health) occurred for just over 5% of respondents-a conflict rate that could be large enough to cause serious bias in some types of statistical analysis. The occurrence of conflicting responses is significantly higher among those with a long-standing health problem, and is significantly lower among people completing the interview online.
Sixth, we used the 3L and 5L responses to estimate flexible statistical models for mapping from 5L measurements to the 3L utility basis approved by NICE for its cost-effectiveness work. This was done separately for randomized groups with 5L preceding 3L and 3L preceding 5L, and we found that the fitted models exhibited substantial structural differences, suggesting that question ordering in MI surveys can have major implications for mapping procedures.
Seventh, we used the alternative 5L3L and 3L5L mapping models to revisit two real-life cost-effectiveness studies, finding that the key ICER differed by 5% in one case and 7% in the other. Although there was no difference in the conclusions in either case (both ICER estimates were well above the £30,000 threshold often cited as the upper limit used by NICE), differences of this magnitude could be important for more marginal cost-effectiveness applications. However, comparison with results obtained by the original study authors suggests that question ordering in the MI survey used to estimate the mapping model has much less effect than the choice of econometric approach used to develop the mapping model.
There are some important implications from our findings for the design of future multi-instrument surveys to be used for mapping between health outcome measures. Most important is to be aware of the possibility that sequencing and timing of the 3L and 5L instruments within an interview may affect responses, and this means that there is a strong case for designs that randomize the ordering of the two instruments and possibly also randomize the use of ascending and descending response scales. Such designs would not "solve" the response error problem, but they would allow the analyst to check for the existence of response distortions and carry out robustness checks.
Many MI surveys used in health economics are now web-based because of the lower cost of web interviewing. There is some concern that web surveys introduce bias through differences in the pattern of non-response or changes in respondents' behavior. It is reassuring that we have found no evidence of significant differences between the CASI and CAWI subsamples, except in the probability of strongly conflicting 3L and 5L responses-where web interviews appear to reduce the frequency of conflict (and possibly thereby improve response quality). 18