Abstract
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
Summary. Comparing assessments on a subjective scale across countries or socioeconomic groups is often hampered by differences in response scales across groups. Anchoring vignettes help to correct for such differences, either in parametric models (the compound hierarchical ordered probit (CHOPIT) model and extensions) or nonparametrically, comparing rankings of vignette ratings and selfassessments across groups. We construct specification tests of parametric models, comparing nonparametric rankings with rankings by using the parametric estimates. Applied to six domains of health, the test always rejects the standard CHOPIT model, but an extended CHOPIT model performs better. This implies a need for more flexible (parametric or semiparametric) models than the standard CHOPIT model.
1. Introduction
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
Socioeconomic surveys often ask ratings on some subjective ordinal scale. A typical example is a selfassessed health question on a fivepoint scale (from excellent to poor, for example). Answers to questions with a subjective scale may not only depend on the objective reality but also on how respondents interpret the subjective answers, i.e. the respondents’ reporting behaviour. Usually, the analysis using these questions aims at comparing the objective reality across socioeconomic groups or countries, and differences in reporting behaviour should be corrected for. To identify these differences, King et al. (2004) have proposed the use of anchoring vignettes. These are short descriptions of hypothetical people or situations. Respondents are asked to evaluate one or more vignettes on the same subjective scale as is used to evaluate their own situation. Because the objective reality in a given vignette is the same for all respondents, systematic differences in vignette evaluations across respondents identify differences in reporting behaviour.
King et al. (2004) introduced a parametric model as well as a nonparametric method to use anchoring vignettes for comparing the distributions of the underlying objective reality of the phenomenon of interest in two or more countries or socioeconomic groups. The parametric model is referred to as the compound hierarchical ordered probit (CHOPIT) model. Research using anchoring vignettes has grown rapidly in recent years. The CHOPIT model and parametric extensions are used in studies on health (Bago d'Uva et al., 2008a,b; Vonkova and Hullegie, 2011), healthcare responsiveness (Rice et al., 2010), work disability (Kapteyn et al., 2007), job satisfaction (Kristensen and Johansson, 2008), satisfaction with social contacts (Bonsang and Van Soest, 2012) and life satisfaction (Angelini et al., 2012). The CHOPIT model consists of ordered probit equations (for vignette evaluations and the assessment of the own situation) with thresholds that are common to all equations but, to account for differences in reporting behaviour, can vary with respondent characteristics.
The nonparametric approach has been used much less often; exceptions are King et al. (2004) and King and Wand (2007). This method essentially compares the distributions in different socioeconomic groups of the rank of the respondent's selfevaluation among the same respondent's vignette evaluations. For example, suppose that one given vignette is evaluated by all respondents in groups A and B; suppose that almost everyone in group A evaluates their own health as better than that of the hypothetical vignette person (the benchmark), whereas in group B the majority evaluates the vignette person's health as better than their own. Then the nonparametric method immediately leads to the conclusion that group A is healthier than group B. This conclusion is still valid if the two groups use very different scales, since it is based on the comparison with the vignette evaluation that (by assumption) uses the same scale as the selfassessment. This method does not require any model or covariates (except to distinguish the two groups).
The nonparametric method relies on two assumptions: reporting behaviour of the respondents is the same in the selfassessments and the vignettes (‘response consistency’) and the objective reality of a vignette is perceived in the same way by all respondents (‘vignette equivalence’). These can be called ‘identifying assumptions’ in the sense that the interpretation of the nonparametric ranking comparison relies on them. These assumptions have been tested in recent studies, with mixed results (see Section 4 for some references). In this paper they are maintained (identifying) assumptions. The parametric model requires additional assumptions. For example, it assumes that the objective reality can be modelled as a linear function of observed characteristics and an unobserved component, a specific functional form of the thresholds and joint normality of the error terms.
In this paper, we compare the rankings that are implied by the parametric model with the nonparametric rankings that come directly from the raw data, using the χ^{2} diagnostic tests that were introduced in Andrews (1988). These can be seen as (mis)specification tests of the parametric model against nonparametric alternatives that lead to different rankings of the selfreports and vignette evaluations. Although many alternative specification tests for the parametric model can be considered, the advantage of our tests is that they have power in a direction that matters: they reject the parametric model if the misspecification implies that using the parametric model leads to biased conclusions concerning ranking comparisons across socioeconomic groups.
We run the tests for six health domains for data on the population of ages 50 years and older in eight European countries, from the 2004 wave of the Survey of Health, Ageing and Retirement in Europe.
We find that the standard CHOPIT model is always rejected, but a simple oneparameter extension that allows for unobserved heterogeneity (which was used by Kapteyn et al. (2007), for example) is rejected for some health domains but not for others. This suggests that conclusions about comparisons across countries or socioeconomic groups based on the standard CHOPIT model may be biased. It also implies that the existing tests for vignette equival ence or response consistency that rely on the CHOPIT model may not be valid. In contrast, the nonparametric method is generally not a viable alternative since it cannot be used with many covariates and cannot produce counterfactual distributions of selfreported health with benchmark reporting scales. We therefore conclude that there is a need for future work on more flexible parametric or semiparametric models that generalize the CHOPIT model.
The remainder of this paper is organized as follows. Section 2 explains the parametric models. In Section 3, we introduce our diagnostic tests. Section 4 relates the tests to the nonparametric approach. Section 5 presents the data. Section 6 describes the results of Monte Carlo simulations guiding how to implement the tests given the size and nature of our data. Our main results are discussed in Section 7.Section 8 discusses implications of our findings for research using anchoring vignettes.
3. Misspecification tests
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
There are many ways to test the specification of a fully parametric model in general and of the standard or extended CHOPIT model in particular. For example, Lagrange multiplier tests can be performed against specific parametric extensions, such as heteroscedastic or nonnormal errors (Chesher and Irish, 1987). Such tests will be powerful against specific alternatives but not in other directions. Since one of our main goals is to compare health or wellbeing across countries or socioeconomic groups after purging selfassessments for response scale differences, we consider misspecification tests with power against alternatives that lead to different conclusions concerning such comparisons.
A general category of misspecification tests are the goodnessoffit tests of Andrews (1988). They partition the product space of outcomes and regressors Y×X into C cells (usually by partitioning Y into M_{Y} cells and X into M_{X} and taking all products). Then the sample distribution over the cells is compared with the distribution that is generated by the estimated parametric model: for the given parameter estimates and for the given regressor values of each observation, the probability distribution of the dependent variable(s) is fully determined by the distribution of error terms and unobserved heterogeneity and the probabilities of each cell in the partition of Y can be computed. Averaging over all observations with regressor values in each given cell of the partition of X then gives the cell probabilities that are generated by the parametric model.
Under the null hypothesis that the parametric model is correctly specified, the sample distribution and the distribution that is generated by the model should be similar. If the parameters of the parametric model were known, this could be formalized with a Pearson χ^{2}test. Andrews (1988) showed that the test statistic can be adjusted to correct for the fact that parameters are estimated by using the same data. This test is in some sense similar to Bayesian posterior predictive checking, which also compares the data with simulated data based on the estimated model, and corrects for parameter uncertainty by averaging over the posterior (Rubin, 1984).
The appropriate test statistic is a quadratic form which asymptotically has a χ^{2}distribution under the null hypothesis that the parametric model is correctly specified. If, as in our case, the parametric model is estimated by maximum likelihood, Andrews (1988) showed that the test statistic can be obtained from an auxiliary ordinary least squares regression of an ndimensional vector 1=(1,1,…,1)^{′}, where n is the number of observations, on two groups of regressors: first, for each of the C cells, an ndimensional vector with, for each observation, the deviation between realizations (1 if the observation is in the given cell; 0 otherwise) and the cell probability according to the model (given the values of X for that observation). This gives an n×C matrix A. The second group is, for all (say L) parameters, the vector of partial derivatives with respect to that parameter of the loglikelihood contributions for all n observations (the ‘scores’), giving an n×L matrix B. These are added to correct for the fact that the parameters are estimated by using the same sample. Under the null of no misspecification, the test statistic T is n times the R^{2} of this regression and is asymptotically χ^{2} distributed with degrees of freedom df equal to the rank of the matrix A. In the usual case of product cells, . With H=[AB], the complete n×(C+L) regressor matrix, we have , where Z^{+} is the Moore–Penrose inverse of matrix Z. See Andrews (1988), page 154.
To give some intuition for this test, note that a perfect fit in the sense that, for each cell, the average simulated cell probability equals the observed fraction of observations in that cell, means that 1^{′}A=0. Moreover, the firstorder conditions of maximum likelihood always imply that 1^{′}B=0. A perfect fit therefore implies T=0 and not rejecting the null. The explained sum of squares can be seen as a measure of how well observed cell fractions are reproduced by the model. However, since the test statistic is based on estimated parameters, it does not have a χ^{2}distribution. As shown by Andrews (1988), adding the scores in B raises the explained sum of squares in case of an imperfect fit and leads to test statistics with aχ^{2}distribution.
Different partitions of Y and X give different tests, with power in different directions. We shall use the partition of Y that is the basis for the nonparametric approach described below in Section 4. In a sense, this partition ‘matters’ since this kind of comparison is one of the main goals of using anchoring vignettes. Since the test is asymptotic, the number of observations must be large to guarantee that the size of the test is approximately equal to the asymptotic size of 5%. In practice, this means that we will have to merge cells to guarantee that the number of observations in each cell is reasonably large. We performed simulations to compute the actual size of the test for various partitions and our choice of cells is based on these simulation outcomes; see Section 6.
4. Nonparametric approach
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
The nonparametric approach is explained by King et al. (2004) and King and Wand (2007). It compares the distribution of the underlying objective reality purged of differences in reporting behaviour across countries or socioeconomic groups. This is done by comparing where the selfassessments are placed on the scale fixed by the vignette evaluations in each country or group. The numerical example in Table 1 with only one vignette illustrates this.
Table 1. Distribution of selfassessments and vignette evaluations in countries A and B Selfassessment  Distributions (%) 

 Vignette evaluation 1 (none)  Vignette evaluation 2 (mild)  Vignette evaluation 3 (moderate)  Vignette evaluation 4 (severe)  Vignette evaluation 5 (extreme)  All 

Country A 
1 (no problem)  4  4  4  4  4  20 
2 (mild problem)  4  4  4  4  4  20 
3 (moderate problem)  4  4  4  4  4  20 
4 (serious problem)  4  4  4  4  4  20 
5 (extreme problem)  4  4  4  4  4  20 
All  20  20  20  20  20  100 
Country B 
1 (no problem)  16  4  4  4  0  28 
2 (mild problem)  8  4  4  4  0  20 
3 (moderate problem)  8  4  4  4  0  20 
4 (serious problem)  8  4  4  4  0  20 
5 (extreme problem)  0  4  4  4  0  12 
All  40  20  20  20  0  100 
These crosstabulations give the joint distributions of selfassessments and vignette evaluations of health problems in a given domain in countries A and B. Looking at the (marginal) distribution of the selfassessments only (the final column) would lead to the conclusion that respondents in country B face fewer health problems than respondents in country A—assuming that they use the same response scales. The difference in the marginal distribution of the vignette evaluations (the final rows), however, shows that this assumption is incorrect: respondents in country A evaluate a given health problem as more problematic, on average, and this may be an alternative explanation for the crosscountry difference in selfassessments.
The nonparametric approach simply compares the relative distributions, i.e. how the selfassessments rank compared with the vignette evaluations in the two countries. The relative rankings RR are as in Table 2.
Table 2. Relative ranking RR of selfassessments and vignette evaluations by country RR  Country A (%)  Country B (%) 

1: selfassessment < vignette evaluation  40  24 
2: selfassessment =vignette evaluation  20  28 
3: selfassessment > vignette evaluation  40  48 
The distribution of RR in country A is stochastically dominated by that in B, showing that, accounting for differences in response behaviour, the health problems in B are more serious than in A. This is the reverse of the conclusion based on selfassessments only.
This example has only one vignette. King et al. (2004) considered the case with K>1 vignettes, assuming that the evaluations of the vignettes are ranked in the same way by each respondent and that no respondent gives the same rating to any two vignettes. In that case a selfassessment can fit in any of 2K+1 positions in the given ranking of the vignette evaluations. The nonparametric approach then compares the distributions over the 2K+1 positions across countries or groups. See King et al. (2004), pages 195–196, for an empirical illustration.
King and Wand (2007) discussed the more realistic case with ties, i.e. situations where a respondent assigns the same rating to several vignettes, or where a respondent rates the vig nettes in a way that does not respect the ranking of the vignette evaluations that are used by the majority (which is often the natural ranking, given the content of the vignettes). For K=3 vignettes, Table 3 presents a complete listing of all possible rankings of vignettes and selfassessments, generalizing Table 1 in King and Wand (2007). The natural ordering of the vignette ratings is . The seven situations in the upper lefthand panel respect this ordering; the remainder looks at ties. Some of these are nonproblematic since all that matters is the position of the selfassessment Y_{s}. For example, the situation puts Y_{s} in the same position as . For the nonparametric comparison, the two will be merged, as indicated by assigning 1 to both of them (column C). But, in other situations, the position of Y_{s} is ambiguous. For example, take the case . If (and Y_{1} and Y_{2} are misreported), we are in situation 7 (Y_{s} is worse than all vignettes) but, if (and Y_{3} is misreported), we are in situation 1 (Y_{s} better than all vignettes). We therefore cannot say anything about the true position of Y_{s} and classify this case as 17.
The nonparametric method categorizes observations into specific cells. The 19 labels in Table 3 define a partition of the set Y of possible realizations of the observed dependent variables (selfassessments and vignette evaluations) into 19 cells. If the population consists of countries A and B and X is a country dummy (the only ‘regressor’), then the two countries form a partition of the set of all values of X. The nonparametric comparison then partitions Y×X into 19×2=38 cells. Ideally, the number of observations in the 12 cells other than those labelled 1,2,…,7 should be so small that these cells can be discarded.
To interpret the nonparametric results, we need the assumptions of response consistency and vignette equivalence that have already been referred to above. These are the identifying assumptions in this framework and we consider them as maintained hypotheses; tests of res ponse consistency and vignette equivalence are discussed elsewhere and are not the topic of this paper. See, for example, van Soest et al. (2011) or Datta Gupta et al. (2010) for tests on response consistency using additional information in the form of a measure on an objective scale; see Peracchi and Rossetti (2013) for a joint test of the overidentifying restrictions that are implied by response consistency and vignette equivalence; and see Bago d'Uva et al. (2011) or Corrado and Weeks (2010) who tested response consistency conditional on vignette equivalence by testing the joint significance of covariates added to the equation for vignette responses.
Finally, we want to emphasize that the nonparametric approach has limited applicability and cannot replace parametric models. Unlike the CHOPIT model or its parametric extensions, the nonparametric approach cannot deal with many covariates; nor can it be used to produce counterfactual distributions of selfreported health with benchmark reporting scales (as in, for example, Kapteyn et al. (2007)). The nonparametric approach is useful only for making comparisons across a few socioeconomic groups or (groups of) countries, and only if such a comparison is not hampered by ties that make it very difficult or impossible to interpret the nonparametric results.
5. Data
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
We use data from the Survey of Health, Ageing and Retirement in Europe collected in 2004. The survey is a broad socioeconomic survey among the population of ages 50 years and older and their spouses in 11 European countries; see BörschSupan and Jürges (2005) for details on the design and setup. All respondents first had a personal interview and were then asked to complete a short pencilandpaper questionnaire. In eight countries (Belgium, France, Germany, Greece, Italy, Netherlands, Spain and Sweden), random subsamples were given an additional questionnaire with selfassessments and vignettes on health and work disability. Here we focus on the health questions, which were also used by, for example, Lardjane and Dourgnon (2007) and Bago d'Uva et al. (2008a). Selfassessments and three vignettes were collected for six health domains—breathing, concentration (and memory skills), depression, mobility, bodily pains and sleeping. The wordings of these questions are given in Appendix A. All questions use the same fivepoint scale: none, mild, moderate, severe or extreme. The three vignettes in each domain are ordered, with one vignette (labelled k=1) describing a mild health problem, the second describing a worse problem (k=2) and the third describing the most severe health problem (k=3). This order is used as the natural order in the nonparametric approach. The vignette subsample has 4544 respondents in total. Owing to missing observations, we have about 4370 respondents for each domain. Precise sample sizes and descriptive statistics for selfassessments and vignettes are presented in Table 4. For each domain, most respondents rate their own health problems as none or mild. Severe or extreme health problems were reported by about 6.5% of the respondents, on average across the domains (from 3.57% for breathing to 9.23% for sleep). The majority reported no problem with mobility or breathing but, for the other domains, the none answers are a minority. In particular, pain problems are quite common, with less than a third reporting none.
Table 4. Distributions of selfassessments and vignette evaluations† Evaluation  Distributions (%) for breathing  Distributions (%) for concentration 

Self  v_{1}  v_{2}  v_{3}  Self  v_{1}  v_{2}  v_{3} 


None  64.58  10.81  2.29  2.45  43.98  22.15  5.25  2.01 
Mild  22.23  24.11  5.15  2.22  35.17  48.63  27.08  8.90 
Moderate  9.62  38.00  19.76  8.59  16.22  22.79  44.37  29.79 
Severe  3.04  24.08  52.20  44.25  4.24  6.09  20.73  47.33 
Extreme  0.53  3.00  20.60  42.49  0.39  0.34  2.58  11.98 
 Distributions (%) for depression  Distributions (%) for mobility 
None  49.51  6.45  2.31  2.20  58.39  9.64  2.31  1.55 
Mild  28.70  44.17  13.44  2.59  22.17  34.73  11.83  5.91 
Moderate  14.97  36.19  45.75  10.80  13.04  42.84  38.80  27.49 
Severe  5.29  11.56  33.53  42.57  5.23  11.97  40.37  48.80 
Extreme  1.53  1.63  4.97  41.84  1.16  0.82  6.69  16.24 
 Distributions (%) for pain  Distributions (%) for sleep 
None  32.28  15.59  2.31  1.12  42.68  2.65  1.92  1.87 
Mild  35.81  56.96  18.06  5.31  28.06  21.50  9.73  7.31 
Moderate  23.10  22.09  50.76  26.10  20.04  47.98  29.02  27.00 
Severe  7.14  4.78  25.71  48.63  7.36  24.03  42.49  41.70 
Extreme  1.67  0.57  3.16  18.84  1.87  3.84  16.84  22.12 
The vignette evaluations reflect the level of the health problems of the hypothetical people in the vignettes. As expected, the person in the third vignette in each domain was typically evaluated as least healthy, followed by the person in the second vignette.
The share of respondents with vignette ratings is about 82% (89.0% for pain, 87.8% for depression, 84.2% for concentration, 82.6% for breathing, 77.0% for mobility and 68.9% for sleep). About 16% of the respondents have interval ties (8.3% for breathing, 13.5% for depression, 14.6% for mobility, 17.5% for pain, 20.4% for sleep and 21.2% for concentration).
For the parametric model in Section 2, we use the background variables that are reported in Table 5. The average age of the respondents is 63 years. We distinguish three education levels: 35% of the sample obtained low education (international standard classification of education levels 0 and 1), 45% intermediate education (international standard classification of education levels 2 and 3) and 20% high education (international standard classification of education levels 4, 5 or 6). Most respondents are women (55.6%) and do not live alone (74.4%).
Table 5. Background variables (mean (age) or percentage (other variables))† Country  Value  Variable  Value 


Belgium  12.48  male  44.43 
France  19.48  education low  35.54 
Germany  11.18  education middle  44.60 
Greece  15.85  education high  19.86 
Italy  9.79  age—mean  63.06 
Netherlands  11.84  age—standard deviation  10.01 
Spain  10.21  alone  25.58 
Sweden  9.18  not alone  74.42 
6. Simulations: how to choose the cells
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
Our tests rely on asymptotic theory keeping the number of cells fixed, with the number of observations going to ∞ (see Andrews (1988)). As a consequence, the finite sample properties of the test may be poor if some cells have few observations. The same problem arises with the classical Pearson χ^{2}test, where a common rule of thumb is that no more than 20% of the expected cell counts should be less than 5 and all expected counts should be at least 1 (Yates et al. (1999), page 734). Such a rule of thumb is not available for the Andrews test. We performed some Monte Carlo simulations to determine whether, for several given choices of the cells, the actual size of the tests in our finite sample of about 4400 observations approximates the asymptotic size of 5%.
First, we estimated the CHOPIT model for one health domain—pain—using the actual data on the 4368 complete observations. The estimates are presented in Table 1 of the online appendix. Using these estimates and the actual values of the covariates X, we generated 300 new data sets, all with the same covariates as the real data, but with different values of the dependent variables (three vignette evaluations and one selfassessment for each observation), constructed from the values of the covariates and independent draws of the error terms in the CHOPIT model. These data sets are all generated by using the CHOPIT model, so they satisfy the null hypothesis of no misspecification. For each given choice of cells, we then perform the Andrews test on all 300 data sets, using a nominal size of 5%. The fraction of times that the null hypothesis is rejected is approximately equal to the actual size of the test for the given sample size and choice of cells. The difference between actual and nominal (5%) size is due to the deviation between the finite sample and the asymptotic distribution of the test statistic.
The results are presented in Table 6. The final column (D10, 19 cells) uses the 19 cells of Y that form the basis of the nonparametric approach in Section 4, combined with several partitions of X. The actual size of the test is much larger than the nominal size (5%)—the tests heavily overreject. The reason is that cell sizes are often too small: even without partitioning X, 15 of the 19 cells have expected cell size less than 5, so the rule of thumb for the simple Pearson χ^{2} goodnessoffit test is not satisfied at all. (See Table 2 of the online appendix.) Cells corresponding to ties where the natural order of the evaluations of the three vignettes is not respected are particularly small (supporting the quality of data). The problem of small cells grows even worse if we further split up the cells by partitioning X.
Table 6. Simulation results: percentage of rejections of the null hypothesis at significance level 5%† Variable X  Rejections (%) for the following partitions: 

D1, 3 cells  D2, 4 cells  D3, 5 cells  D4, 5 cells  D5, 6 cells  D6, 7 cells  D7, 8 cells  D8, 9 cells  D9, 10 cells  D10, 19 cells 


Partitions based on Y only 
 8.33  7.33  7.67  7.33  7.33  10.00  10.67  8.00  8.67  25.33 
Partitions based onY×X 
sex  7.67  6.00  6.67  6.33  5.33  6.33  5.67  4.67  6.67  70.33 
country  7.67  6.67  6.67  6.67  6.67  8.33  9.67  8.67  9.33  64.67 
alone  8.67  6.33  6.67  7.00  6.33  9.00  9.00  9.33  10.33  77.67 
education  8.33  7.67  9.00  8.67  8.00  10.00  8.67  10.00  11.33  96.67 
age  6.33  7.00  6.67  7.00  6.67  9.33  9.33  10.33  11.00  93.00 
To avoid the problem of small cells, we can partition Y into fewer cells. Nine sorts of natural ways of doing this are presented in Table 7. The resulting actual sizes of the tests using these partitions of Y combined with various partitions of X are given in the additional columns of Table 6. For example, the column ‘D1, 3 cells’ partitions Y into three categories: rankings where the selfassessment unambiguously indicates less pain problems than all three vignettes, rankings where the selfassessment is rated the same as the vignette with the least pain problems and all other rankings. The first two are by far the most frequent rankings in the data (34% and 24% of the observations). The column ‘D9, 10 cells’ takes the seven rankings that respect the natural vignette ordering as separate cells and merges the 13 cells of Y into three. This is already enough to guarantee that the rule of thumb that at most 20% of all cells have expected cell size less than 5 is satisfied. The other columns are intermediate cases where Y is partitioned in 4–9 cells. When Y is partitioned into at most six cells, the expected cell size is always larger than 10; in the other cases, the expected cell size is often between 5 and 10, particularly if X is partitioned into three cells by using education.
Table 7. Description of how 19 original cells are merged† Division  Number of larger cells  Description of merging 19 original cells 


D1  3  {1},{2},{3,…,19} 
D2  4  {1},{2},{3,4,8,…,14},{5,6,7,15,…,19} 
D3  5  {1},{2},{3,4},{5,6,7},{8,…,19} 
D4  5  {1,2},{3,4},{5,6,7},{8,…,15},{16,…,19} 
D5  6  {1},{2},{3,4},{5,6,7},{8,…,15},{16,…,19} 
D6  7  {1},{2},{3,4},{5,6,7},{8,…,11},{12,…,15},{16,…,19} 
D7  8  {1},{2},{3},{4},{5,6,7},{8,…,11},{12,…,15},{16,…,19} 
D8  9  {1},{2},{3},{4},{5,6},{7},{8,…,11},{12,…,15},{16,…,19} 
D9  10  {1},{2},{3},{4},{5},{6},{7},{8,…,11},{12,…,15},{16,…,19} 
D10  19  {1},…,{19} 
The results show that merging cells helps to bring the actual size of the tests closer to the nominal size of 5%. The actual size varies between 4.7% (Y partitioned into nine cells; X partitioned by using gender) and 11.3% (Y partitioned in 10 cells; X partitioned by using education). The test still tends to reject too often: the actual size is almost always larger than 5% and increases somewhat when the number of cells in Y is larger than 6. It seems safe to conclude that we can perform the tests by using the given partitions of Y into 3–6 cells, though we should take into account that the actual size of the test with nominal size 5% varies from around 5% to about 10%.
7. Empirical results
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
This section presents the test results for each domain and for various partitions of Y and X. For each health domain, we first estimated the parametric model of Section 2. As an example, the parameter estimates for concentration are presented in Table 8; estimates for the other domains are available on request. The first column shows how, according to the parametric model, problems with concentration are associated with individual characteristics and country dummy variables, keeping response scales constant. Most results here are plausible and confirm findings in the literature (e.g. Bago d'Uva et al. (2008a)).
Table 8. Estimates of the parametric model for concentration  β_{s}  γ_{1}  γ_{2}  γ_{3}  γ_{4} 

 Coefficient  tvalue  Coefficient  tvalue  Coefficient  tvalue  Coefficient  tvalue  Coefficient  tvalue 

constant  0  —  −0.1848  −3.9786  0.2037  5.5851  0.1117  2.8092  0.1820  3.1522 
Belgium  0.0362  0.5715  −0.0961  −2.0688  0.1161  3.0912  −0.1385  −3.4545  −0.0849  −1.3969 
Germany  0.1044  1.5750  0.2324  5.0074  −0.1477  −3.6080  −0.2057  −4.8098  −0.0913  −1.4099 
Greece  0.0489  0.8311  0.4653  11.6894  −0.2131  −5.6407  −0.3526  −9.0218  −0.1239  −2.1732 
Italy  0.1160  1.7402  0.2563  5.6073  −0.1153  −2.7832  −0.2790  −6.6135  −0.0896  −1.3732 
Netherlands  −0.2155  −3.2028  −0.0843  −1.7336  0.0843  2.0973  −0.2964  −6.5579  −0.3917  −6.4416 
Spain  −0.1173  −2.0938  0.0171  0.3475  −0.2393  −5.2428  −0.1450  −3.8082  0.2399  4.2768 
Sweden  −0.8136  −11.3444  −0.4506  −6.6301  −0.4259  −6.5886  −0.3165  −6.0508  0.1623  3.2131 
male  −0.0819  −2.1414  0.0560  2.2164  −0.0346  −1.5160  0.0076  0.3226  −0.0459  −1.5449 
age 55−  −0.0145  −0.3038  0.0457  1.4469  −0.0341  −1.1783  −0.0448  −1.4627  0.0024  0.0615 
age 66−75  0.1942  3.9206  −0.0862  −2.6178  0.0162  0.5655  0.0764  2.6049  0.0434  1.1049 
age 76+  0.4663  7.4514  −0.0352  −0.8465  −0.0236  −0.5972  0.0371  0.9748  0.0849  1.7471 
education low  0.2557  6.2358  0.0807  2.6240  −0.1078  −3.8602  0.0121  0.4370  −0.0562  −1.4797 
education high  −0.1710  −3.1905  0.0046  0.1325  −0.0381  −1.2653  −0.0771  −2.2560  −0.0066  −0.1679 
alone  0.1622  3.7273  0.0302  1.0282  −0.0396  −1.4765  −0.0065  −0.2473  0.0648  1.8565 
 Coefficient  Standard error         
 0.5436  0.0250         
 1.3237  0.0237         
 2.0452  0.0294         
σ_{s}  1  —         
σ_{u}  0.4030  0.0115         
σ_{v}  0.6968  0.0151         
The other columns concern the threshold parameters. Many variables are significant, implying that not accounting for differences in reporting behaviour would lead to biased estimates of the parameters of main interest in the first column. The vignette dummy variables in the bottom panel have the expected ranking, since the first vignette describes the mildest problem, etc. Finally, the standard deviation of the unobserved heterogeneity term is quite precisely estimated, with 95% confidence interval [0.380, 0.426]. This suggests that extending the CHOPIT model with unobserved heterogeneity is useful, even though the role of this unobserved heterogeneity term is smaller than the roles of the noise terms ɛ_{si} and ɛ_{vi},v=1,2,3.
For each misspecification test characterized by a different partition of Y×X, the cell probabilities according to the parametric model were computed (numerically), using the given values of the regressors for all observations in the sample and the estimates of the parametric model. Each test compares such a distribution with the corresponding distribution in the raw data. We performed many tests for different partitions and health domains and two model specifications, and the test statistics are often not independent of each other since they rely on the same data (with the same or correlated dependent variables). The results should be interpreted with care because of the issue of multiplehypothesis testing—the tests should be interpreted in isolation and different tests cannot be combined into a joint conclusion about some common hypothesis.
The test results for the standard CHOPIT model are easy to summarize (and do not require a table): the null hypothesis of a correctly specified model is always rejected at the 5% or even the 1% level, for all health domains and for all partitions of Y×X. The pvalues for the CHOPIT model extended with unobserved heterogeneity in the thresholds show more variation and are presented in Table 9.
Table 9. Goodness of fit for all health domains†  pvalues for breathing  pvalues for concentration 

 D1  D2  D3  D4  D5  D1  D2  D3  D4  D5 


Y only  r  r  r  r  r  0.106  0.212  0.187  0.029  0.030 
Y× sex  r  r  r  r  r  0.172  0.336  0.210  0.085  0.068 
Y× country  r  r  r  r  r  r  r  r  0.011  r 
Y× alone  r  r  r  r  r  0.027  0.073  0.122  0.119  0.025 
Y× education  r  r  r  r  r  r  r  r  r  r 
Y× age  r  r  r  r  r  0.099  0.268  0.074  0.007  0.014 
 pvalues for depression  pvalues for mobility 
Y only  0.050  0.008  0.003  r  r  r  r  r  r  r 
Y× sex  r  r  r  r  r  r  0.002  0.001  r  r 
Y× country  r  r  r  r  r  r  r  r  0.011  r 
Y× alone  0.139  0.024  0.013  r  r  r  r  r  r  r 
Y× education  r  r  r  r  r  0.001  0.004  r  r  r 
Y× age  0.012  0.006  0.004  0.002  r  0.003  0.007  0.001  r  r 
 pvalues for pain  pvalues for sleep 
Y only  0.168  0.304  0.009  0.003  0.003  r  r  r  r  r 
Y× sex  0.064  0.176  r  r  r  r  r  r  r  r 
Y× country  r  r  r  r  r  r  r  r  r  r 
Y× alone  0.030  0.051  0.001  r  r  r  r  r  r  r 
Y× education  r  r  r  r  r  r  r  r  r  r 
Y× age  0.364  0.524  0.002  r  r  r  r  r  r  r 
First consider the tests partitioning Y only. For concentration, the null hypothesis is not rejected at the 1% level for any partition of Y, and not at the 5% level for the three partitions with the smallest numbers of cells. Taking into account that the simulations have shown that the tests tend to overreject, even the pvalues of 0.029 and 0.030 seem to be supportive of this parametric model. For breathing, mobility and sleep, the parametric specification is always rejected, no matter which partition of Y is used. For the other two domains, the results are mixed: the null is not rejected at the 5% level for a partition into only three cells (depression) or into three or four cells (pain), but it is rejected at the 1% level for the partitions with more cells.
For the partition of Y into four cells, the observed and predicted probabilities on which the tests rely are presented in Table 10. For pain, concentration and depression the maximum difference is about 1 percentage point; for mobility and sleep it increases to 1.5 and for breathing to 2.4 percentage points. Comparing the two distributions suggests that the differences are not that big even for breathing where the null is firmly rejected. Because of the large sample sizes, however, modest differences are apparently sufficient to reject the null hypothesis.
Table 10. Observed and predicted distributions (cells constructed by using partition D2 of Y only)† D2 cells  Distributions (%) for breathing  Distributions (%) for concentration  Distributions (%) for depression 

Observed  Predicted  Observed  Predicted  Observed  Predicted 


{1}  71.09  72.12  38.12  38.84  57.73  58.60 
{2}  13.74  14.60  23.56  22.53  17.01  16.64 
{3,4,8,…,14}  12.13  9.73  26.32  26.58  18.59  17.39 
{5,6,7,15,…,19}  3.04  3.55  12.00  12.05  6.68  7.38 
 Distributions (%) for mobility  Distributions (%) for pain  Distributions (%) for sleep 
{1}  63.26  63.03  34.43  34.93  61.37  62.93 
{2}  13.25  14.79  23.81  22.79  10.40  10.40 
{3,4,8,…,14}  15.00  13.59  25.09  25.50  17.61  16.00 
{5,6,7,15,…,19}  8.50  8.59  16.67  16.80  10.62  10.67 
The tests using partitions of Y only essentially explore whether the parametric model reproduces the ranking distribution of vignette evaluations and selfassessments, which is a feature of the marginal distribution of the dependent variables. This does not yet correspond to the nonparametric approach—which compares two such rankings, distinguished on the basis of X (e.g. two countries or groups of countries, men and women, etc.). This is why we also want to consider partitions of Y×X. Each different partition of X gives a test with power in a specific direction, corresponding to the cells that are used by the nonparametric approach for comparing specific groups.
The remaining rows in Table 9 present the pvalues for such tests. To guarantee sufficiently large sample sizes, we consider only partitions of X into two or three cells, leading to a partition of Y×X into between 2×3=6 and 3×6=18 cells. First, we consider a partition Y×country, where countries are divided into two groups—southern Europe (Greece, Spain and Italy) and the rest (Belgium, France, Germany, the Netherlands and Sweden). This north–south division corresponds to the systematic differences that have been found in many studies based on the Survey of Health, Ageing and Retirement in Europe; see, for example, many chapters in BörschSupan et al. (2005). In addition, we perform the test for partitions Y×sex and Y×age, with age categorized into younger than 56, 56–65 and older than 65 years, Y×education, with education categorized into low, middle and high, and finally Y×alone, distinguishing respondents living alone or not.
As expected (on the basis of the results for the partition of Y only) the null hypothesis is rejected at the 5% or even the 1% significance level for all partitions of Y×X for breathing, sleep and mobility: if the parametric model is already not able to reproduce the marginal distribution over the Y cells, it cannot give a good fit to the bivariate Y×X cells either. For the other domains, pain, concentration and depression, the results are mixed. pvalues exceeding 5% are found for partitions Y×sex and Y×age for pain and for concentration. However, for all domains, the null hypothesis is rejected for a partition of X by country or level of education. The tests therefore do not support the use of the parametric model for comparing across (southern versus northern) countries or levels of education.
Table 11 presents the predicted and actual distributions over the four cells in partition D2 for southern and northern countries, illustrating the magnitude of the differences. In most cases, the differences do not affect the qualitative conclusions on crosscountry differences. For breathing, for example, 83.8% of southern respondents report less problems than any of the vignettes, compared with 63.8% in the north, suggesting that, after correcting for response scale differences, people aged 50 years or older have larger problems with breathing in the north than in the south. Using the parametric model's predictions, the percentages are 83.0% and 65.9% and the conclusion remains the same. Other domains lead to the same qualitative conclusion: whether using the raw data or the parametric predictions sometimes changes which cell frequency is larger but does not affect the conclusion on where health in the given domain is better. This conclusion also holds for other partitions of X (see Tables 3–6 of the online appendix).
Table 11. Example of observed and predicted distributions (cells constructed by using partition D2 of Y×country)† D2 cells  Distributions (%) for the south  Distributions (%) for the north 

Observed  Predicted  Observed  Predicted 


 Breathing 
    
{1}  83.79  82.96  63.80  65.92 
{2}  7.30  10.40  17.44  16.99 
{3,4,8,…,14}  6.72  5.35  15.23  12.24 
{5,6,7,15,…,19}  2.19  1.29  3.53  4.85 
 Concentration 
    
{1}  35.80  33.17  39.44  42.12 
{2}  20.95  24.01  25.06  21.67 
{3,4,8,…,14}  28.34  28.44  25.17  25.50 
{5,6,7,15,…,19}  14.91  14.38  10.32  10.70 
 Depression 
    
{1}  57.02  56.80  58.15  59.62 
{2}  13.91  16.56  18.78  16.69 
{3,4,8,…,14}  20.42  18.15  17.52  16.95 
{5,6,7,15,…,19}  8.65  8.49  5.55  6.74 
 Mobility 
    
{1}  65.90  64.25  61.73  62.33 
{2}  11.96  14.96  13.98  14.69 
{3,4,8,…,14}  13.54  12.89  15.86  13.98 
{5,6,7,15,…,19}  8.60  7.89  8.43  8.99 
 Pain 
    
{1}  33.04  30.07  35.24  37.74 
{2}  20.55  23.36  25.70  22.45 
{3,4,8,…,14}  26.06  27.34  24.53  24.44 
{5,6,7,15,…,19}  20.35  19.24  14.52  15.37 
 Sleep 
    
{1}  64.16  64.28  59.76  62.15 
{2}  7.29  10.64  12.19  10.26 
{3,4,8,…,14}  17.82  14.95  17.49  16.61 
{5,6,7,15,…,19}  10.73  10.12  10.56  10.99 
To see why the tests often reject the parametric model, we performed several sensitivity checks. First, Table 11 indicates that predicted and actual cell frequencies are not hugely different, but differences are apparently sufficiently large to reject. A possible reason is that the Andrews test accounts for the fact that parameters are estimated to fit the (same) data, by including the likelihood scores in the auxiliary regression that is used to compute the test statistic (the matrix B in Section 3). To see whether this matters, we also computed the test statistic without the likelihood scores. Table 7 of the online appendix shows the resulting pvalues, which are often much higher than the correct pvalues in Table 9. For sleep, for example, the fact that parameters are estimated apparently makes it likely that, under the null hypothesis, predicted and observed frequencies are very similar. Adjusting for this implies that the null hypothesis is already rejected for quite modest differences between predicted and actual cell sizes. For breathing, in contrast, all pvalues remain virtually 0. Here the discrepancies are such that the null would also be firmly rejected if parameter values were given instead of estimated.
Finally, to study whether the results of the tests are driven by wrongly ordered vignettes (leading to ties in the rankings that are typically not used when interpreting the nonparametric results; see Section 4), we reestimated the parametric model by using the observations with only and recomputed pvalues for all partitions in Table 9. The pvalues for this subsample are similar to those for the whole sample (see Table 8 of the online appendix). It therefore seems unlikely that wrongly ordered vignettes are the source of the misspecification of the parametric model.
8. Conclusion
 Top of page
 Abstract
 1. Introduction
 2. Parametric model
 3. Misspecification tests
 4. Nonparametric approach
 5. Data
 6. Simulations: how to choose the cells
 7. Empirical results
 8. Conclusion
 Acknowledgements
 References
 Appendix
 Supporting Information
In the literature, there are two ways to use anchoring vignettes: parametric models (the CHOPIT model and its extensions) and a nonparametric approach based on ranking vignette evaluations and selfassessments in subsamples (such as countries or age groups). We consider tests for misspecification of the parametric models based on comparing data that are generated by the parametric model with the observed data. An attractive feature of our tests is that they compare exactly those features of the simulated and observed data that drive the nonparametric approach. In that sense, the tests have power in directions that matter: rejecting the null hypothesis implies that the rankings of selfassessments and vignette evaluations implied by the parametric model are inconsistent with the rankings in the raw data that are used for the nonparametric approach. We apply the tests on six health domains of the population aged 50 years and over in eight European countries, with mixed results: the standard CHOPIT model is always rejected, but an extension of it incorporating unobserved heterogeneity in the reporting scales is not rejected for several health domains and socioeconomic characteristics.
What does this imply for studies using anchoring vignettes? It does not imply that the parametric models should be replaced by the nonparametric method: as explained in Section 4 the nonparametric approach has limited applicability and gives results that are difficult to interpret when there are many ties. The latter seems a major reason for considering parametric models: if vignettes are not ordered consistently by all respondents, the ties make it impossible to draw conclusions from the nonparametric comparisons. This problem does not arise in the parametric models, where idiosyncratic errors can explain any violation of the natural ordering of the vignette evaluations. Moreover, rejecting the specification of the standard CHOPIT model does not necessarily mean that this model does not help to correct for differences in reporting behaviour. In fact, the modest differences between predicted and actual rankings suggest that it does. Misspecification of the parametric models as such may not come as a big surprise, since, after all, such models are only simplified approximations to the datagenerating process. The more important question is whether misspecification matters for substantive conclusions.
For researchers working on developing anchoring vignette models, our findings motivate future work on developing more flexible parametric and semiparametric models than the CHOPIT model, but share its advantages—models that can be used to construct counterfactual distributions in combination with many covariates, and that can deal with idiosyncratic errors and the ties that result from them in the same way as the CHOPIT model. As our results show, adding unobserved heterogeneity in the thresholds is one simple example of such an extension: with only one additional parameter, it helps to reduce misspecification problems substantially. But many other extensions can be considered: error terms could be heteroscedastic and systematic parts could be made more flexible by using interactions and polynomial expansions. Unobserved heterogeneity can be incorporated in a more flexible way as in Heckman and Singer (1984), or error terms can follow more flexible, seminonparametric distributions as in Gallant and Nychka (1987). These more flexible specifications can be tested by using the misspecification test in this paper. For applied researchers who do not want to develop new models, our results imply the recommendation to use the extended CHOPIT model that was considered here and other (semi)parametric extensions, once these have been developed and validated.
Model misspecification may be less important than the validity of the two identifying assumptions response consistency and vignette equivalence. But our results also have implications for the research on testing these assumptions, which until now typically uses the CHOPIT model as a maintained assumption. It leads to the recommendation to use more flexible models than the CHOPIT model as a basis for these tests. With the current tests, it could be that the null hypothesis of (for example) response consistency is rejected not because of lack of response consistency but because the parametric model in which the test is applied is misspecified. Using more flexible models as a basis will make the tests more robust.