SEARCH

SEARCH BY CITATION

Keywords:

  • Compound hierarchical ordered probit model;
  • Reporting bias;
  • Self-assessed health;
  • Specification testing

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

Summary.  Comparing assessments on a subjective scale across countries or socio-economic groups is often hampered by differences in response scales across groups. Anchoring vignettes help to correct for such differences, either in parametric models (the compound hierarchical ordered probit (CHOPIT) model and extensions) or non-parametrically, comparing rankings of vignette ratings and self-assessments across groups. We construct specification tests of parametric models, comparing non-parametric rankings with rankings by using the parametric estimates. Applied to six domains of health, the test always rejects the standard CHOPIT model, but an extended CHOPIT model performs better. This implies a need for more flexible (parametric or semiparametric) models than the standard CHOPIT model.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

Socio-economic surveys often ask ratings on some subjective ordinal scale. A typical example is a self-assessed health question on a five-point scale (from excellent to poor, for example). Answers to questions with a subjective scale may not only depend on the objective reality but also on how respondents interpret the subjective answers, i.e. the respondents’ reporting behaviour. Usually, the analysis using these questions aims at comparing the objective reality across socio-economic groups or countries, and differences in reporting behaviour should be corrected for. To identify these differences, King et al. (2004) have proposed the use of anchoring vignettes. These are short descriptions of hypothetical people or situations. Respondents are asked to evaluate one or more vignettes on the same subjective scale as is used to evaluate their own situation. Because the objective reality in a given vignette is the same for all respondents, systematic differences in vignette evaluations across respondents identify differences in reporting behaviour.

King et al. (2004) introduced a parametric model as well as a non-parametric method to use anchoring vignettes for comparing the distributions of the underlying objective reality of the phenomenon of interest in two or more countries or socio-economic groups. The parametric model is referred to as the compound hierarchical ordered probit (CHOPIT) model. Research using anchoring vignettes has grown rapidly in recent years. The CHOPIT model and parametric extensions are used in studies on health (Bago d'Uva et al., 2008a,b; Vonkova and Hullegie, 2011), healthcare responsiveness (Rice et al., 2010), work disability (Kapteyn et al., 2007), job satisfaction (Kristensen and Johansson, 2008), satisfaction with social contacts (Bonsang and Van Soest, 2012) and life satisfaction (Angelini et al., 2012). The CHOPIT model consists of ordered probit equations (for vignette evaluations and the assessment of the own situation) with thresholds that are common to all equations but, to account for differences in reporting behaviour, can vary with respondent characteristics.

The non-parametric approach has been used much less often; exceptions are King et al. (2004) and King and Wand (2007). This method essentially compares the distributions in different socio-economic groups of the rank of the respondent's self-evaluation among the same respondent's vignette evaluations. For example, suppose that one given vignette is evaluated by all respondents in groups A and B; suppose that almost everyone in group A evaluates their own health as better than that of the hypothetical vignette person (the benchmark), whereas in group B the majority evaluates the vignette person's health as better than their own. Then the non-parametric method immediately leads to the conclusion that group A is healthier than group B. This conclusion is still valid if the two groups use very different scales, since it is based on the comparison with the vignette evaluation that (by assumption) uses the same scale as the self-assessment. This method does not require any model or covariates (except to distinguish the two groups).

The non-parametric method relies on two assumptions: reporting behaviour of the respondents is the same in the self-assessments and the vignettes (‘response consistency’) and the objective reality of a vignette is perceived in the same way by all respondents (‘vignette equivalence’). These can be called ‘identifying assumptions’ in the sense that the interpretation of the non-parametric ranking comparison relies on them. These assumptions have been tested in recent studies, with mixed results (see Section 4 for some references). In this paper they are maintained (identifying) assumptions. The parametric model requires additional assumptions. For example, it assumes that the objective reality can be modelled as a linear function of observed characteristics and an unobserved component, a specific functional form of the thresholds and joint normality of the error terms.

In this paper, we compare the rankings that are implied by the parametric model with the non-parametric rankings that come directly from the raw data, using the χ2 diagnostic tests that were introduced in Andrews (1988). These can be seen as (mis)specification tests of the parametric model against non-parametric alternatives that lead to different rankings of the self-reports and vignette evaluations. Although many alternative specification tests for the parametric model can be considered, the advantage of our tests is that they have power in a direction that matters: they reject the parametric model if the misspecification implies that using the parametric model leads to biased conclusions concerning ranking comparisons across socio-economic groups.

We run the tests for six health domains for data on the population of ages 50 years and older in eight European countries, from the 2004 wave of the Survey of Health, Ageing and Retirement in Europe.

We find that the standard CHOPIT model is always rejected, but a simple one-parameter extension that allows for unobserved heterogeneity (which was used by Kapteyn et al. (2007), for example) is rejected for some health domains but not for others. This suggests that conclusions about comparisons across countries or socio-economic groups based on the standard CHOPIT model may be biased. It also implies that the existing tests for vignette equival ence or response consistency that rely on the CHOPIT model may not be valid. In contrast, the non-parametric method is generally not a viable alternative since it cannot be used with many covariates and cannot produce counterfactual distributions of self-reported health with benchmark reporting scales. We therefore conclude that there is a need for future work on more flexible parametric or semiparametric models that generalize the CHOPIT model.

The remainder of this paper is organized as follows. Section 2 explains the parametric models. In Section 3, we introduce our diagnostic tests. Section 4 relates the tests to the non-parametric approach. Section 5 presents the data. Section 6 describes the results of Monte Carlo simulations guiding how to implement the tests given the size and nature of our data. Our main results are discussed in Section 7.Section 8 discusses implications of our findings for research using anchoring vignettes.

2. Parametric model

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

We first describe the CHOPIT model (King et al., 2004)—commonly used in studies with anchoring vignettes—and a one-parameter extension of this model. For the sake of exposition we assume that self-assessments and vignettes concern one given health domain, as in our empirical application. (We model different domains completely separately; see Peracchi and Rossetti (2012) for modelling them jointly.) The response scale is a five-point scale from no prob lem (‘none’) in the given health domain to an extreme problem (‘extreme’; see Appendix A). The CHOPIT model consists of a self-assessment equation for the respondents’ evaluation of their own health (in the given domain) and a vignette equation for the evaluation(s) of the health of the vignette person(s). The self-assessment of respondent i is modelled as follows:

  • image(1)
  • image(2)
  • image(3)
  • image(4)
  • image(5)

Here inline image is the latent health of respondent i in the given domain, modelled as the sum of a linear combination of explanatory variables Xi and an unobserved component ɛsi, reflecting unobserved heterogeneity and reporting error. The observed value of self-assessed health Ysi is equal to j ( ∈ {1,…,J}); in our application, J=5) if latent health is between thresholds inline image and inline image. The thresholds can vary with respondent characteristics Xi and, moreover, with unobserved characteristics ui. A large value of ui means that all thresholds of respondent i are relatively high, implying that the respondent does not easily evaluate health problems in the given domain as severe or extreme. This unobserved heterogeneity term is not included in the standard CHOPIT model of King et al. (2004) but was introduced as an extension by Kapteyn et al. (2007).

The evaluations of K vignettes v=1,…,K (K=3 in our application) by respondent i are modelled as follows:

  • image(6)
  • image(7)

where inline image is the latent health of the hypothetical person described in vignette v in the given domain, modelled as a sum of a vignette-specific constant θv and an unobserved component ɛvi. θv does not vary across respondents since it is assumed that each vignette is interpreted in the same way by all respondents (vignette equivalence). Yvi is the reported evaluation of vignette v by respondent i on the same scale as is used for the self-assessments: Yvi is equal to j=1,…,J if the latent health inline image is between thresholds inline image and inline image. The assumption of response consistency implies that the thresholds are the same as for the self-assessments.

The error terms inline image, and the unobserved heterogeneity term ui are assumed to be independent of each other and of the covariates Xi, with normal distributions that have mean 0 and variances inline image, inline image and inline image respectively. By means of normalization, we impose βs,1=0 and σs=1. The standard way to estimate the standard or extended CHOPIT model is by maximum likelihood. Note that the standard CHOPIT model is the special case with inline image.

3. Misspecification tests

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

There are many ways to test the specification of a fully parametric model in general and of the standard or extended CHOPIT model in particular. For example, Lagrange multiplier tests can be performed against specific parametric extensions, such as heteroscedastic or non-normal errors (Chesher and Irish, 1987). Such tests will be powerful against specific alternatives but not in other directions. Since one of our main goals is to compare health or wellbeing across countries or socio-economic groups after purging self-assessments for response scale differences, we consider misspecification tests with power against alternatives that lead to different conclusions concerning such comparisons.

A general category of misspecification tests are the goodness-of-fit tests of Andrews (1988). They partition the product space of outcomes and regressors Y×X into C cells (usually by partitioning Y into MY cells and X into MX and taking all inline image products). Then the sample distribution over the cells is compared with the distribution that is generated by the estimated parametric model: for the given parameter estimates and for the given regressor values of each observation, the probability distribution of the dependent variable(s) is fully determined by the distribution of error terms and unobserved heterogeneity and the probabilities of each cell in the partition of Y can be computed. Averaging over all observations with regressor values in each given cell of the partition of X then gives the cell probabilities that are generated by the parametric model.

Under the null hypothesis that the parametric model is correctly specified, the sample distribution and the distribution that is generated by the model should be similar. If the parameters of the parametric model were known, this could be formalized with a Pearson χ2-test. Andrews (1988) showed that the test statistic can be adjusted to correct for the fact that parameters are estimated by using the same data. This test is in some sense similar to Bayesian posterior predictive checking, which also compares the data with simulated data based on the estimated model, and corrects for parameter uncertainty by averaging over the posterior (Rubin, 1984).

The appropriate test statistic is a quadratic form which asymptotically has a χ2-distribution under the null hypothesis that the parametric model is correctly specified. If, as in our case, the parametric model is estimated by maximum likelihood, Andrews (1988) showed that the test statistic can be obtained from an auxiliary ordinary least squares regression of an n-dimensional vector 1=(1,1,…,1), where n is the number of observations, on two groups of regressors: first, for each of the C cells, an n-dimensional vector with, for each observation, the deviation between realizations (1 if the observation is in the given cell; 0 otherwise) and the cell probability according to the model (given the values of X for that observation). This gives an n×C matrix A. The second group is, for all (say L) parameters, the vector of partial derivatives with respect to that parameter of the log-likelihood contributions for all n observations (the ‘scores’), giving an n×L matrix B. These are added to correct for the fact that the parameters are estimated by using the same sample. Under the null of no misspecification, the test statistic T is n times the R2 of this regression and is asymptotically χ2 distributed with degrees of freedom df equal to the rank of the matrix A. In the usual case of inline image product cells, inline image. With H=[A|B], the complete n×(C+L) regressor matrix, we have inline image, where Z+ is the Moore–Penrose inverse of matrix Z. See Andrews (1988), page 154.

To give some intuition for this test, note that a perfect fit in the sense that, for each cell, the average simulated cell probability equals the observed fraction of observations in that cell, means that 1A=0. Moreover, the first-order conditions of maximum likelihood always imply that 1B=0. A perfect fit therefore implies T=0 and not rejecting the null. The explained sum of squares inline image can be seen as a measure of how well observed cell fractions are reproduced by the model. However, since the test statistic is based on estimated parameters, it does not have a χ2-distribution. As shown by Andrews (1988), adding the scores in B raises the explained sum of squares in case of an imperfect fit and leads to test statistics with aχ2-distribution.

Different partitions of Y and X give different tests, with power in different directions. We shall use the partition of Y that is the basis for the non-parametric approach described below in Section 4. In a sense, this partition ‘matters’ since this kind of comparison is one of the main goals of using anchoring vignettes. Since the test is asymptotic, the number of observations must be large to guarantee that the size of the test is approximately equal to the asymptotic size of 5%. In practice, this means that we will have to merge cells to guarantee that the number of observations in each cell is reasonably large. We performed simulations to compute the actual size of the test for various partitions and our choice of cells is based on these simulation outcomes; see Section 6.

4. Non-parametric approach

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

The non-parametric approach is explained by King et al. (2004) and King and Wand (2007). It compares the distribution of the underlying objective reality purged of differences in reporting behaviour across countries or socio-economic groups. This is done by comparing where the self-assessments are placed on the scale fixed by the vignette evaluations in each country or group. The numerical example in Table 1 with only one vignette illustrates this.

Table 1. Distribution of self-assessments and vignette evaluations in countries A and B
Self-assessment Distributions (%)
  Vignette evaluation 1 (none) Vignette evaluation 2 (mild) Vignette evaluation 3 (moderate) Vignette evaluation 4 (severe) Vignette evaluation 5 (extreme) All
Country A
1 (no problem)4444420
2 (mild problem)4444420
3 (moderate problem)4444420
4 (serious problem)4444420
5 (extreme problem)4444420
All2020202020100
Country B
1 (no problem)16444028
2 (mild problem)8444020
3 (moderate problem)8444020
4 (serious problem)8444020
5 (extreme problem)0444012
All402020200100

These cross-tabulations give the joint distributions of self-assessments and vignette evaluations of health problems in a given domain in countries A and B. Looking at the (marginal) distribution of the self-assessments only (the final column) would lead to the conclusion that respondents in country B face fewer health problems than respondents in country A—assuming that they use the same response scales. The difference in the marginal distribution of the vignette evaluations (the final rows), however, shows that this assumption is incorrect: respondents in country A evaluate a given health problem as more problematic, on average, and this may be an alternative explanation for the cross-country difference in self-assessments.

The non-parametric approach simply compares the relative distributions, i.e. how the self-assessments rank compared with the vignette evaluations in the two countries. The relative rankings RR are as in Table 2.

Table 2. Relative ranking RR of self-assessments and vignette evaluations by country
RR Country A (%) Country B (%)
1: self-assessment < vignette evaluation4024
2: self-assessment =vignette evaluation2028
3: self-assessment > vignette evaluation4048

The distribution of RR in country A is stochastically dominated by that in B, showing that, accounting for differences in response behaviour, the health problems in B are more serious than in A. This is the reverse of the conclusion based on self-assessments only.

This example has only one vignette. King et al. (2004) considered the case with K>1 vignettes, assuming that the evaluations of the vignettes are ranked in the same way by each respondent and that no respondent gives the same rating to any two vignettes. In that case a self-assessment can fit in any of 2K+1 positions in the given ranking of the vignette evaluations. The non-parametric approach then compares the distributions over the 2K+1 positions across countries or groups. See King et al. (2004), pages 195–196, for an empirical illustration.

King and Wand (2007) discussed the more realistic case with ties, i.e. situations where a respondent assigns the same rating to several vignettes, or where a respondent rates the vig nettes in a way that does not respect the ranking of the vignette evaluations that are used by the majority (which is often the natural ranking, given the content of the vignettes). For K=3 vignettes, Table 3 presents a complete listing of all possible rankings of vignettes and self-assessments, generalizing Table 1 in King and Wand (2007). The natural ordering of the vignette ratings is inline image. The seven situations in the upper left-hand panel respect this ordering; the remainder looks at ties. Some of these are non-problematic since all that matters is the position of the self-assessment Ys. For example, the situation inline image puts Ys in the same position as inline image. For the non-parametric comparison, the two will be merged, as indicated by assigning 1 to both of them (column C). But, in other situations, the position of Ys is ambiguous. For example, take the case inline image. If inline image (and Y1 and Y2 are misreported), we are in situation 7 (Ys is worse than all vignettes) but, if inline image (and Y3 is misreported), we are in situation 1 (Ys better than all vignettes). We therefore cannot say anything about the true position of Ys and classify this case as 1-7.

Table 3. Rankings of self-assessment Ys and vignette evaluations Y1, Y2 and Y3
Ranking C C-label Ranking C C-label
  1. †There are 19 different cells according to the non-parametric approach. We define their labels (column ‘C-label’) as follows: if C=1,…,7 labels remain the same. In case of interval ties C=1-4, 1-5, 1-6, 1-7, 2-4, 2-5, 2-6, 2-7, 3-6, 3-7, 4-6, 4-7 labels are defined as 8, 9, …,19 respectively.

inline image 11 inline image 11
inline image 22 inline image 22
inline image 33 inline image 33
inline image 44 inline image 3-616
inline image 55 inline image 3-717
inline image 66 inline image 4-719
inline image 77 inline image 77
inline image 11 inline image 11
inline image 1-48 inline image 1-48
inline image 1-59 inline image 1-59
inline image 2-513 inline image 1-610
inline image 55 inline image 1-711
inline image 66 inline image 2-715
inline image 77 inline image 77
inline image 11 inline image 11
inline image 1-610 inline image 1-610
inline image 1-711 inline image 1-711
inline image 2-715 inline image 1-711
inline image 3-717 inline image 1-711
inline image 4-719 inline image 2-715
inline image 77 inline image 77
inline image 11 inline image 11
inline image 2-412 inline image 1-610
inline image 55 inline image 1-711
inline image 66 inline image 2-715
inline image 77 inline image 77
inline image 11 inline image 11
inline image 2-614 inline image 1-48
inline image 3-717 inline image 1-59
inline image 4-719 inline image 2-614
inline image 77 inline image 77
inline image 11 inline image 11
inline image 22 inline image 1-610
inline image 33 inline image 1-711
inline image 4-618 inline image 2-715
inline image 77 inline image 77
inline image 11 inline image 2-614
inline image 77   

The non-parametric method categorizes observations into specific cells. The 19 labels in Table 3 define a partition of the set Y of possible realizations of the observed dependent variables (self-assessments and vignette evaluations) into 19 cells. If the population consists of countries A and B and X is a country dummy (the only ‘regressor’), then the two countries form a partition of the set of all values of X. The non-parametric comparison then partitions Y×X into 19×2=38 cells. Ideally, the number of observations in the 12 cells other than those labelled 1,2,…,7 should be so small that these cells can be discarded.

To interpret the non-parametric results, we need the assumptions of response consistency and vignette equivalence that have already been referred to above. These are the identifying assumptions in this framework and we consider them as maintained hypotheses; tests of res ponse consistency and vignette equivalence are discussed elsewhere and are not the topic of this paper. See, for example, van Soest et al. (2011) or Datta Gupta et al. (2010) for tests on response consistency using additional information in the form of a measure on an objective scale; see Peracchi and Rossetti (2013) for a joint test of the overidentifying restrictions that are implied by response consistency and vignette equivalence; and see Bago d'Uva et al. (2011) or Corrado and Weeks (2010) who tested response consistency conditional on vignette equivalence by testing the joint significance of covariates added to the equation for vignette responses.

Finally, we want to emphasize that the non-parametric approach has limited applicability and cannot replace parametric models. Unlike the CHOPIT model or its parametric extensions, the non-parametric approach cannot deal with many covariates; nor can it be used to produce counterfactual distributions of self-reported health with benchmark reporting scales (as in, for example, Kapteyn et al. (2007)). The non-parametric approach is useful only for making comparisons across a few socio-economic groups or (groups of) countries, and only if such a comparison is not hampered by ties that make it very difficult or impossible to interpret the non-parametric results.

5. Data

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

We use data from the Survey of Health, Ageing and Retirement in Europe collected in 2004. The survey is a broad socio-economic survey among the population of ages 50 years and older and their spouses in 11 European countries; see Börsch-Supan and Jürges (2005) for details on the design and set-up. All respondents first had a personal interview and were then asked to complete a short pencil-and-paper questionnaire. In eight countries (Belgium, France, Germany, Greece, Italy, Netherlands, Spain and Sweden), random subsamples were given an additional questionnaire with self-assessments and vignettes on health and work disability. Here we focus on the health questions, which were also used by, for example, Lardjane and Dourgnon (2007) and Bago d'Uva et al. (2008a). Self-assessments and three vignettes were collected for six health domains—breathing, concentration (and memory skills), depression, mobility, bodily pains and sleeping. The wordings of these questions are given in Appendix A. All questions use the same five-point scale: none, mild, moderate, severe or extreme. The three vignettes in each domain are ordered, with one vignette (labelled k=1) describing a mild health problem, the second describing a worse problem (k=2) and the third describing the most severe health problem (k=3). This order is used as the natural order in the non-parametric approach. The vignette subsample has 4544 respondents in total. Owing to missing observations, we have about 4370 respondents for each domain. Precise sample sizes and descriptive statistics for self-assessments and vignettes are presented in Table 4. For each domain, most respondents rate their own health problems as none or mild. Severe or extreme health problems were reported by about 6.5% of the respondents, on average across the domains (from 3.57% for breathing to 9.23% for sleep). The majority reported no problem with mobility or breathing but, for the other domains, the none answers are a minority. In particular, pain problems are quite common, with less than a third reporting none.

Table 4. Distributions of self-assessments and vignette evaluations†
Evaluation Distributions (%) for breathing Distributions (%) for concentration
Self v1 v2 v3 Self v1 v2 v3
  1. †Self is self-assessment; v1, v2 and v3 are vignettes 1, 2 and 3 respectively. The size of the vignette subsample of the sample from the Survey of Health, Ageing and Retirement in Europe is 4544. We work with around 4370 respondents for each domain (4368 for breathing, 4384 for concentration, 4369 for depression, 4379 for mobility, 4368 for pain and 4377 for sleep).

None64.5810.812.292.4543.9822.155.252.01
Mild22.2324.115.152.2235.1748.6327.088.90
Moderate9.6238.0019.768.5916.2222.7944.3729.79
Severe3.0424.0852.2044.254.246.0920.7347.33
Extreme0.533.0020.6042.490.390.342.5811.98
  Distributions (%) for depression Distributions (%) for mobility
None49.516.452.312.2058.399.642.311.55
Mild28.7044.1713.442.5922.1734.7311.835.91
Moderate14.9736.1945.7510.8013.0442.8438.8027.49
Severe5.2911.5633.5342.575.2311.9740.3748.80
Extreme1.531.634.9741.841.160.826.6916.24
  Distributions (%) for pain Distributions (%) for sleep
None32.2815.592.311.1242.682.651.921.87
Mild35.8156.9618.065.3128.0621.509.737.31
Moderate23.1022.0950.7626.1020.0447.9829.0227.00
Severe7.144.7825.7148.637.3624.0342.4941.70
Extreme1.670.573.1618.841.873.8416.8422.12

The vignette evaluations reflect the level of the health problems of the hypothetical people in the vignettes. As expected, the person in the third vignette in each domain was typically evaluated as least healthy, followed by the person in the second vignette.

The share of respondents with vignette ratings inline image is about 82% (89.0% for pain, 87.8% for depression, 84.2% for concentration, 82.6% for breathing, 77.0% for mobility and 68.9% for sleep). About 16% of the respondents have interval ties (8.3% for breathing, 13.5% for depression, 14.6% for mobility, 17.5% for pain, 20.4% for sleep and 21.2% for concentration).

For the parametric model in Section 2, we use the background variables that are reported in Table 5. The average age of the respondents is 63 years. We distinguish three education levels: 35% of the sample obtained low education (international standard classification of education levels 0 and 1), 45% intermediate education (international standard classification of education levels 2 and 3) and 20% high education (international standard classification of education levels 4, 5 or 6). Most respondents are women (55.6%) and do not live alone (74.4%).

Table 5. Background variables (mean (age) or percentage (other variables))†
Country Value Variable Value
  1. †The descriptive statistics are given for the 4544 respondents in the vignette subsample of the Survey of Health, Ageing and Retirement in Europe 2004 study. education middle corresponds to international standard classification of education levels 2 and 3; education high to the levels 4, 5 and 6. not alone is based on the observed marital status of the respondent; it corresponds to categories married and living together with spouse, or living together with registered partnership. alone corresponds to all other categories.

Belgium12.48male44.43
France19.48education low35.54
Germany11.18education middle44.60
Greece15.85education high19.86
Italy9.79age—mean63.06
Netherlands11.84age—standard deviation10.01
Spain10.21alone25.58
Sweden9.18not alone74.42

6. Simulations: how to choose the cells

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

Our tests rely on asymptotic theory keeping the number of cells fixed, with the number of observations going to ∞ (see Andrews (1988)). As a consequence, the finite sample properties of the test may be poor if some cells have few observations. The same problem arises with the classical Pearson χ2-test, where a common rule of thumb is that no more than 20% of the expected cell counts should be less than 5 and all expected counts should be at least 1 (Yates et al. (1999), page 734). Such a rule of thumb is not available for the Andrews test. We performed some Monte Carlo simulations to determine whether, for several given choices of the cells, the actual size of the tests in our finite sample of about 4400 observations approximates the asymptotic size of 5%.

First, we estimated the CHOPIT model for one health domain—pain—using the actual data on the 4368 complete observations. The estimates are presented in Table 1 of the on-line appendix. Using these estimates and the actual values of the covariates X, we generated 300 new data sets, all with the same covariates as the real data, but with different values of the dependent variables (three vignette evaluations and one self-assessment for each observation), constructed from the values of the covariates and independent draws of the error terms in the CHOPIT model. These data sets are all generated by using the CHOPIT model, so they satisfy the null hypothesis of no misspecification. For each given choice of cells, we then perform the Andrews test on all 300 data sets, using a nominal size of 5%. The fraction of times that the null hypothesis is rejected is approximately equal to the actual size of the test for the given sample size and choice of cells. The difference between actual and nominal (5%) size is due to the deviation between the finite sample and the asymptotic distribution of the test statistic.

The results are presented in Table 6. The final column (D10, 19 cells) uses the 19 cells of Y that form the basis of the non-parametric approach in Section 4, combined with several partitions of X. The actual size of the test is much larger than the nominal size (5%)—the tests heavily overreject. The reason is that cell sizes are often too small: even without partitioning X, 15 of the 19 cells have expected cell size less than 5, so the rule of thumb for the simple Pearson χ2 goodness-of-fit test is not satisfied at all. (See Table 2 of the on-line appendix.) Cells corresponding to ties where the natural order of the evaluations of the three vignettes is not respected are particularly small (supporting the quality of data). The problem of small cells grows even worse if we further split up the cells by partitioning X.

Table 6. Simulation results: percentage of rejections of the null hypothesis at significance level 5%†
Variable X Rejections (%) for the following partitions:
D1, 3 cells D2, 4 cells D3, 5 cells D4, 5 cells D5, 6 cells D6, 7 cells D7, 8 cells D8, 9 cells D9, 10 cells D10, 19 cells
  1. †How the simulations are carried out and the p-values are obtained is explained in Section 6. The partitions D1–D10 are defined in Table 7. For partitions Y×X, countries are divided into two groups—southern Europe (Greece, Spain and Italy) and the remaining countries (Belgium, France, Germany, the Netherlands and Sweden), age is categorized into younger than 56, 56–65 and older than 65 years and education is categorized into low, middle and high.

Partitions based on Y only
 8.337.337.677.337.3310.0010.678.008.6725.33
Partitions based onY×X
sex7.676.006.676.335.336.335.674.676.6770.33
country7.676.676.676.676.678.339.678.679.3364.67
alone8.676.336.677.006.339.009.009.3310.3377.67
education8.337.679.008.678.0010.008.6710.0011.3396.67
age6.337.006.677.006.679.339.3310.3311.0093.00

To avoid the problem of small cells, we can partition Y into fewer cells. Nine sorts of natural ways of doing this are presented in Table 7. The resulting actual sizes of the tests using these partitions of Y combined with various partitions of X are given in the additional columns of Table 6. For example, the column ‘D1, 3 cells’ partitions Y into three categories: rankings where the self-assessment unambiguously indicates less pain problems than all three vignettes, rankings where the self-assessment is rated the same as the vignette with the least pain problems and all other rankings. The first two are by far the most frequent rankings in the data (34% and 24% of the observations). The column ‘D9, 10 cells’ takes the seven rankings that respect the natural vignette ordering as separate cells and merges the 13 cells of Y into three. This is already enough to guarantee that the rule of thumb that at most 20% of all cells have expected cell size less than 5 is satisfied. The other columns are intermediate cases where Y is partitioned in 4–9 cells. When Y is partitioned into at most six cells, the expected cell size is always larger than 10; in the other cases, the expected cell size is often between 5 and 10, particularly if X is partitioned into three cells by using education.

Table 7. Description of how 19 original cells are merged†
Division Number of larger cells Description of merging 19 original cells
  1. †For the description, the labels of the 19 original cells are used (see Table 3).

D13{1},{2},{3,…,19}
D24{1},{2},{3,4,8,…,14},{5,6,7,15,…,19}
D35{1},{2},{3,4},{5,6,7},{8,…,19}
D45{1,2},{3,4},{5,6,7},{8,…,15},{16,…,19}
D56{1},{2},{3,4},{5,6,7},{8,…,15},{16,…,19}
D67{1},{2},{3,4},{5,6,7},{8,…,11},{12,…,15},{16,…,19}
D78{1},{2},{3},{4},{5,6,7},{8,…,11},{12,…,15},{16,…,19}
D89{1},{2},{3},{4},{5,6},{7},{8,…,11},{12,…,15},{16,…,19}
D910{1},{2},{3},{4},{5},{6},{7},{8,…,11},{12,…,15},{16,…,19}
D1019{1},…,{19}

The results show that merging cells helps to bring the actual size of the tests closer to the nominal size of 5%. The actual size varies between 4.7% (Y partitioned into nine cells; X partitioned by using gender) and 11.3% (Y partitioned in 10 cells; X partitioned by using education). The test still tends to reject too often: the actual size is almost always larger than 5% and increases somewhat when the number of cells in Y is larger than 6. It seems safe to conclude that we can perform the tests by using the given partitions of Y into 3–6 cells, though we should take into account that the actual size of the test with nominal size 5% varies from around 5% to about 10%.

7. Empirical results

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

This section presents the test results for each domain and for various partitions of Y and X. For each health domain, we first estimated the parametric model of Section 2. As an example, the parameter estimates for concentration are presented in Table 8; estimates for the other domains are available on request. The first column shows how, according to the parametric model, problems with concentration are associated with individual characteristics and country dummy variables, keeping response scales constant. Most results here are plausible and confirm findings in the literature (e.g. Bago d'Uva et al. (2008a)).

Table 8. Estimates of the parametric model for concentration
  βs γ1 γ2 γ3 γ4
  Coefficient t-value Coefficient t-value Coefficient t-value Coefficient t-value Coefficient t-value
constant0−0.1848−3.97860.20375.58510.11172.80920.18203.1522
Belgium0.03620.5715−0.0961−2.06880.11613.0912−0.1385−3.4545−0.0849−1.3969
Germany0.10441.57500.23245.0074−0.1477−3.6080−0.2057−4.8098−0.0913−1.4099
Greece0.04890.83110.465311.6894−0.2131−5.6407−0.3526−9.0218−0.1239−2.1732
Italy0.11601.74020.25635.6073−0.1153−2.7832−0.2790−6.6135−0.0896−1.3732
Netherlands−0.2155−3.2028−0.0843−1.73360.08432.0973−0.2964−6.5579−0.3917−6.4416
Spain−0.1173−2.09380.01710.3475−0.2393−5.2428−0.1450−3.80820.23994.2768
Sweden−0.8136−11.3444−0.4506−6.6301−0.4259−6.5886−0.3165−6.05080.16233.2131
male−0.0819−2.14140.05602.2164−0.0346−1.51600.00760.3226−0.0459−1.5449
age 55−−0.0145−0.30380.04571.4469−0.0341−1.1783−0.0448−1.46270.00240.0615
age 66−750.19423.9206−0.0862−2.61780.01620.56550.07642.60490.04341.1049
age 76+0.46637.4514−0.0352−0.8465−0.0236−0.59720.03710.97480.08491.7471
education low0.25576.23580.08072.6240−0.1078−3.86020.01210.4370−0.0562−1.4797
education high−0.1710−3.19050.00460.1325−0.0381−1.2653−0.0771−2.2560−0.0066−0.1679
alone0.16223.72730.03021.0282−0.0396−1.4765−0.0065−0.24730.06481.8565
  Coefficient Standard error         
inline image 0.54360.0250        
inline image 1.32370.0237        
inline image 2.04520.0294        
σs 1        
σu 0.40300.0115        
σv 0.69680.0151        

The other columns concern the threshold parameters. Many variables are significant, implying that not accounting for differences in reporting behaviour would lead to biased estimates of the parameters of main interest in the first column. The vignette dummy variables in the bottom panel have the expected ranking, since the first vignette describes the mildest problem, etc. Finally, the standard deviation of the unobserved heterogeneity term is quite precisely estimated, with 95% confidence interval [0.380, 0.426]. This suggests that extending the CHOPIT model with unobserved heterogeneity is useful, even though the role of this unobserved heterogeneity term is smaller than the roles of the noise terms ɛsi and ɛvi,v=1,2,3.

For each misspecification test characterized by a different partition of Y×X, the cell probabilities according to the parametric model were computed (numerically), using the given values of the regressors for all observations in the sample and the estimates of the parametric model. Each test compares such a distribution with the corresponding distribution in the raw data. We performed many tests for different partitions and health domains and two model specifications, and the test statistics are often not independent of each other since they rely on the same data (with the same or correlated dependent variables). The results should be interpreted with care because of the issue of multiple-hypothesis testing—the tests should be interpreted in isolation and different tests cannot be combined into a joint conclusion about some common hypothesis.

The test results for the standard CHOPIT model are easy to summarize (and do not require a table): the null hypothesis of a correctly specified model is always rejected at the 5% or even the 1% level, for all health domains and for all partitions of Y×X. The p-values for the CHOPIT model extended with unobserved heterogeneity in the thresholds show more variation and are presented in Table 9.

Table 9. Goodness of fit for all health domains†
  p-values for breathing p-values for concentration
  D1 D2 D3 D4 D5 D1 D2 D3 D4 D5
  1. †‘r’ labels that the null hypothesis that the parametric model is correctly specified is rejected on significance level less than 0.001. All results are presented for the extended CHOPIT model with unobserved heterogeneity. For the definition of D1,…, D5 see Table 7.

Y onlyrrrrr0.1060.2120.1870.0290.030
Y× sexrrrrr0.1720.3360.2100.0850.068
Y× countryrrrrrrrr0.011r
Y× alonerrrrr0.0270.0730.1220.1190.025
Y× educationrrrrrrrrrr
Y× agerrrrr0.0990.2680.0740.0070.014
  p-values for depression p-values for mobility
Y only0.0500.0080.003rrrrrrr
Y× sexrrrrrr0.0020.001rr
Y× countryrrrrrrrr0.011r
Y× alone0.1390.0240.013rrrrrrr
Y× educationrrrrr0.0010.004rrr
Y× age0.0120.0060.0040.002r0.0030.0070.001rr
  p-values for pain p-values for sleep
Y only0.1680.3040.0090.0030.003rrrrr
Y× sex0.0640.176rrrrrrrr
Y× countryrrrrrrrrrr
Y× alone0.0300.0510.001rrrrrrr
Y× educationrrrrrrrrrr
Y× age0.3640.5240.002rrrrrrr

First consider the tests partitioning Y only. For concentration, the null hypothesis is not rejected at the 1% level for any partition of Y, and not at the 5% level for the three partitions with the smallest numbers of cells. Taking into account that the simulations have shown that the tests tend to overreject, even the p-values of 0.029 and 0.030 seem to be supportive of this parametric model. For breathing, mobility and sleep, the parametric specification is always rejected, no matter which partition of Y is used. For the other two domains, the results are mixed: the null is not rejected at the 5% level for a partition into only three cells (depression) or into three or four cells (pain), but it is rejected at the 1% level for the partitions with more cells.

For the partition of Y into four cells, the observed and predicted probabilities on which the tests rely are presented in Table 10. For pain, concentration and depression the maximum difference is about 1 percentage point; for mobility and sleep it increases to 1.5 and for breathing to 2.4 percentage points. Comparing the two distributions suggests that the differences are not that big even for breathing where the null is firmly rejected. Because of the large sample sizes, however, modest differences are apparently sufficient to reject the null hypothesis.

Table 10. Observed and predicted distributions (cells constructed by using partition D2 of Y only)†
D2 cells Distributions (%) for breathing Distributions (%) for concentration Distributions (%) for depression
Observed Predicted Observed Predicted Observed Predicted
  1. †Predicted distributions are presented for the extended CHOPIT model with unobserved heterogeneity.

{1}71.0972.1238.1238.8457.7358.60
{2}13.7414.6023.5622.5317.0116.64
{3,4,8,…,14}12.139.7326.3226.5818.5917.39
{5,6,7,15,…,19}3.043.5512.0012.056.687.38
  Distributions (%) for mobility Distributions (%) for pain Distributions (%) for sleep
{1}63.2663.0334.4334.9361.3762.93
{2}13.2514.7923.8122.7910.4010.40
{3,4,8,…,14}15.0013.5925.0925.5017.6116.00
{5,6,7,15,…,19}8.508.5916.6716.8010.6210.67

The tests using partitions of Y only essentially explore whether the parametric model reproduces the ranking distribution of vignette evaluations and self-assessments, which is a feature of the marginal distribution of the dependent variables. This does not yet correspond to the non-parametric approach—which compares two such rankings, distinguished on the basis of X (e.g. two countries or groups of countries, men and women, etc.). This is why we also want to consider partitions of Y×X. Each different partition of X gives a test with power in a specific direction, corresponding to the cells that are used by the non-parametric approach for comparing specific groups.

The remaining rows in Table 9 present the p-values for such tests. To guarantee sufficiently large sample sizes, we consider only partitions of X into two or three cells, leading to a partition of Y×X into between 2×3=6 and 3×6=18 cells. First, we consider a partition Y×country, where countries are divided into two groups—southern Europe (Greece, Spain and Italy) and the rest (Belgium, France, Germany, the Netherlands and Sweden). This north–south division corresponds to the systematic differences that have been found in many studies based on the Survey of Health, Ageing and Retirement in Europe; see, for example, many chapters in Börsch-Supan et al. (2005). In addition, we perform the test for partitions Y×sex and Y×age, with age categorized into younger than 56, 56–65 and older than 65 years, Y×education, with education categorized into low, middle and high, and finally Y×alone, distinguishing respondents living alone or not.

As expected (on the basis of the results for the partition of Y only) the null hypothesis is rejected at the 5% or even the 1% significance level for all partitions of Y×X for breathing, sleep and mobility: if the parametric model is already not able to reproduce the marginal distribution over the Y cells, it cannot give a good fit to the bivariate Y×X cells either. For the other domains, pain, concentration and depression, the results are mixed. p-values exceeding 5% are found for partitions Y×sex and Y×age for pain and for concentration. However, for all domains, the null hypothesis is rejected for a partition of X by country or level of education. The tests therefore do not support the use of the parametric model for comparing across (southern versus northern) countries or levels of education.

Table 11 presents the predicted and actual distributions over the four cells in partition D2 for southern and northern countries, illustrating the magnitude of the differences. In most cases, the differences do not affect the qualitative conclusions on cross-country differences. For breathing, for example, 83.8% of southern respondents report less problems than any of the vignettes, compared with 63.8% in the north, suggesting that, after correcting for response scale differences, people aged 50 years or older have larger problems with breathing in the north than in the south. Using the parametric model's predictions, the percentages are 83.0% and 65.9% and the conclusion remains the same. Other domains lead to the same qualitative conclusion: whether using the raw data or the parametric predictions sometimes changes which cell frequency is larger but does not affect the conclusion on where health in the given domain is better. This conclusion also holds for other partitions of X (see Tables 3–6 of the on-line appendix).

Table 11. Example of observed and predicted distributions (cells constructed by using partition D2 of Y×country)†
D2 cells Distributions (%) for the south Distributions (%) for the north
Observed Predicted Observed Predicted
  1. †Predicted distributions are presented for the extended CHOPIT model with unobserved heterogeneity.

  Breathing
     
{1}83.7982.9663.8065.92
{2}7.3010.4017.4416.99
{3,4,8,…,14}6.725.3515.2312.24
{5,6,7,15,…,19}2.191.293.534.85
  Concentration
     
{1}35.8033.1739.4442.12
{2}20.9524.0125.0621.67
{3,4,8,…,14}28.3428.4425.1725.50
{5,6,7,15,…,19}14.9114.3810.3210.70
  Depression
     
{1}57.0256.8058.1559.62
{2}13.9116.5618.7816.69
{3,4,8,…,14}20.4218.1517.5216.95
{5,6,7,15,…,19}8.658.495.556.74
  Mobility
     
{1}65.9064.2561.7362.33
{2}11.9614.9613.9814.69
{3,4,8,…,14}13.5412.8915.8613.98
{5,6,7,15,…,19}8.607.898.438.99
  Pain
     
{1}33.0430.0735.2437.74
{2}20.5523.3625.7022.45
{3,4,8,…,14}26.0627.3424.5324.44
{5,6,7,15,…,19}20.3519.2414.5215.37
  Sleep
     
{1}64.1664.2859.7662.15
{2}7.2910.6412.1910.26
{3,4,8,…,14}17.8214.9517.4916.61
{5,6,7,15,…,19}10.7310.1210.5610.99

To see why the tests often reject the parametric model, we performed several sensitivity checks. First, Table 11 indicates that predicted and actual cell frequencies are not hugely different, but differences are apparently sufficiently large to reject. A possible reason is that the Andrews test accounts for the fact that parameters are estimated to fit the (same) data, by including the likelihood scores in the auxiliary regression that is used to compute the test statistic (the matrix B in Section 3). To see whether this matters, we also computed the test statistic without the likelihood scores. Table 7 of the on-line appendix shows the resulting p-values, which are often much higher than the correct p-values in Table 9. For sleep, for example, the fact that parameters are estimated apparently makes it likely that, under the null hypothesis, predicted and observed frequencies are very similar. Adjusting for this implies that the null hypothesis is already rejected for quite modest differences between predicted and actual cell sizes. For breathing, in contrast, all p-values remain virtually 0. Here the discrepancies are such that the null would also be firmly rejected if parameter values were given instead of estimated.

Finally, to study whether the results of the tests are driven by wrongly ordered vignettes (leading to ties in the rankings that are typically not used when interpreting the non-parametric results; see Section 4), we re-estimated the parametric model by using the observations with inline image only and recomputed p-values for all partitions in Table 9. The p-values for this subsample are similar to those for the whole sample (see Table 8 of the on-line appendix). It therefore seems unlikely that wrongly ordered vignettes are the source of the misspecification of the parametric model.

8. Conclusion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

In the literature, there are two ways to use anchoring vignettes: parametric models (the CHOPIT model and its extensions) and a non-parametric approach based on ranking vignette evaluations and self-assessments in subsamples (such as countries or age groups). We consider tests for misspecification of the parametric models based on comparing data that are generated by the parametric model with the observed data. An attractive feature of our tests is that they compare exactly those features of the simulated and observed data that drive the non-parametric approach. In that sense, the tests have power in directions that matter: rejecting the null hypothesis implies that the rankings of self-assessments and vignette evaluations implied by the parametric model are inconsistent with the rankings in the raw data that are used for the non-parametric approach. We apply the tests on six health domains of the population aged 50 years and over in eight European countries, with mixed results: the standard CHOPIT model is always rejected, but an extension of it incorporating unobserved heterogeneity in the reporting scales is not rejected for several health domains and socio-economic characteristics.

What does this imply for studies using anchoring vignettes? It does not imply that the parametric models should be replaced by the non-parametric method: as explained in Section 4 the non-parametric approach has limited applicability and gives results that are difficult to interpret when there are many ties. The latter seems a major reason for considering parametric models: if vignettes are not ordered consistently by all respondents, the ties make it impossible to draw conclusions from the non-parametric comparisons. This problem does not arise in the parametric models, where idiosyncratic errors can explain any violation of the natural ordering of the vignette evaluations. Moreover, rejecting the specification of the standard CHOPIT model does not necessarily mean that this model does not help to correct for differences in reporting behaviour. In fact, the modest differences between predicted and actual rankings suggest that it does. Misspecification of the parametric models as such may not come as a big surprise, since, after all, such models are only simplified approximations to the data-generating process. The more important question is whether misspecification matters for substantive conclusions.

For researchers working on developing anchoring vignette models, our findings motivate future work on developing more flexible parametric and semiparametric models than the CHOPIT model, but share its advantages—models that can be used to construct counterfactual distributions in combination with many covariates, and that can deal with idiosyncratic errors and the ties that result from them in the same way as the CHOPIT model. As our results show, adding unobserved heterogeneity in the thresholds is one simple example of such an extension: with only one additional parameter, it helps to reduce misspecification problems substantially. But many other extensions can be considered: error terms could be heteroscedastic and systematic parts could be made more flexible by using interactions and polynomial expansions. Unobserved heterogeneity can be incorporated in a more flexible way as in Heckman and Singer (1984), or error terms can follow more flexible, semi-non-parametric distributions as in Gallant and Nychka (1987). These more flexible specifications can be tested by using the misspecification test in this paper. For applied researchers who do not want to develop new models, our results imply the recommendation to use the extended CHOPIT model that was considered here and other (semi)parametric extensions, once these have been developed and validated.

Model misspecification may be less important than the validity of the two identifying assumptions response consistency and vignette equivalence. But our results also have implications for the research on testing these assumptions, which until now typically uses the CHOPIT model as a maintained assumption. It leads to the recommendation to use more flexible models than the CHOPIT model as a basis for these tests. With the current tests, it could be that the null hypothesis of (for example) response consistency is rejected not because of lack of response consistency but because the parametric model in which the test is applied is misspecified. Using more flexible models as a basis will make the tests more robust.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

The authors are grateful to the Joint Editor, an Associate Editor and two reviewers for helpful comments. This research was partly funded by the US National Institute on Ageing and partly by Netspar. This paper uses data from the Survey of Health, Ageing and Retirement in Europe, release 2.0.1 (wave 1). The Survey of Health, Ageing and Retirement in Europe data collection was primarily funded by the European Commission through its fifth and sixth framework programmes (projects QLK6-CT-2001-00360, RII-CT-2006-062193 and CIT5-CT-2005-028857). Additional funding by the US National Institute on Aging as well as by various national sources is gratefully acknowledged (see http://www.share.project.org for a full list of funding institutions). Finalizing this paper was enabled by the project ‘The relationships between skills, schooling and labor market outcomes: a longitudinal study’ (P402/12/G130) funded by the Czech Science Foundation.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information
  • Andrews, D. W. K. (1988) Chi-square diagnostic tests for econometric models. J. Econmetr., 37, 135156.
  • Angelini, V., Cavapozzi, D., Corazzini, L. and Paccagnella, O. (2012) Age, health and life satisfaction among older Europeans. Socl Indic. Res., 105, 293308.
  • Bago d'Uva, T., Lindeboom, M., O'Donnell, O. and Van Doorslaer, E. (2011) Slipping anchor?: testing the vignettes approach to identification and correction of reporting heterogeneity. J. Hum. Resour., 46, 872903.
  • Bago d'Uva, T., O'Donnell, O. and Van Doorslaer, E. (2008a) Differential health reporting by education level and its impact on the measurement of health inequalities among older Europeans. Int. J. Epidem., 37, 13751383.
  • Bago d'Uva, T., Van Doorslaer, E., Lindeboom, M. and O'Donnell, O. (2008b) Does reporting heterogeneity bias the measurement of health disparities?Hlth Econ., 17, 351375.
  • Bonsang, E. and Van Soest, A. (2012) Satisfaction with social contacts of older Europeans. Socl Indic. Res., 105, 273292.
  • Börsch-Supan, A., Brugiavini, A., Jürges, H., Mackenbach, J., Siegrist, J. and Weber, G. (2005) Health, Ageing and Retirement in Europe—First Results from the Survey of Health, Ageing and Retirement in Europe. Mannheim: Mannheim Research Institute for the Economics of Aging.
  • Börsch-Supan, A. and Jürges, H. (2005) The Survey of Health, Ageing, and Retirement in Europe—Methodology. Mannheim: Mannheim Research Institute for the Economics of Aging.
  • Chesher, A. and Irish, M. (1987) Residual analysis in the grouped and censored normal linear model. J. Econmetr., 34, 3361.
  • Corrado, L. and Weeks, M. (2010) Identification strategies in survey response using vignettes. Working Paper in Economics 1031 . University of Cambridge, Cambridge.
  • Datta Gupta, N., Kristensen, N. and Pozzoli, D. (2010) External validation of the use of vignettes in cross-country health studies. Econ. Modllng, 27, 854867.
  • Gallant, R. and Nychka, D. (1987) Semi-non-parametric maximum likelihood estimation. Econometrica, 55, 363390.
  • Heckman, J. and Singer, B. (1984) A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica, 52, 271320.
  • Kapteyn, A., Smith, J. and Van Soest, A. (2007) Vignettes and self-reports of work disability in the U.S. and the Netherlands. Am. Econ. Rev., 97, 461473.
  • King, G., Murray, C. J. L., Salomon, J. and Tandon, A. (2004) Enhancing the validity and cross-cultural com parability of measurement in survey research. Am. Polit. Sci. Rev., 98, 191207.
  • King, G. and Wand, J. (2007) Comparing incomparable survey responses: evaluating and selecting anchoring vignettes. Polit. Anal., 15, 4666.
  • Kristensen, N. and Johansson, E. (2008) New evidence on cross-country differences in job satisfaction using anchoring vignettes. Lab. Econ., 15, 96117.
  • Lardjane, S. and Dourgnon, P. (2007) Les comparaisons internationales d’état de santé subjectif sont-elles pertinentes?: une évaluation par la méthode des vignettes-étalons. Econ. Statist ., no. 403–404, 165177.
  • Peracchi, F. and Rossetti, C. (2012) Heterogeneity in health responses and anchoring vignettes. Empir. Econ., 42, 513538.
  • Peracchi, F. and Rossetti, C. (2013) The heterogeneous thresholds ordered response model: identification and inference. J. R. Statist. Soc. A, 176, in the press.
  • Rice, N., Robone, S. and Smith, P. (2010) International comparison of public sector performance: the use of anchoring vignettes to adjust self-reported data. Evaluation, 16, 81101.
  • Rubin, D. (1984) Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Statist., 12, 11511172.
  • van Soest, A., Delaney, L., Harmon, C., Kapteyn, A. and Smith, J. P. (2011) Validating the use of anchoring vignettes for the correction of response scale differences in subjective questions. J. R. Statist. Soc. A, 174, 575595.
  • Voňková, H. and Hullegie, P. (2011) Is the anchoring vignette method sensitive to the domain and the choice of the vignette?J. R. Statist. Soc. A, 174, 597620.
  • Yates, D., Moore, D. and McCabe, G. (1999) The Practice of Statistics, 1st edn. New York: Freeman.

Appendix

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

Appendix A: Self-assessment questions and vignettes

Here we present examples of self-assessment questions and vignettes for two domains: concentration and pain. For other domains, see the on-line appendix. Vignettes are in increasing order of the seriousness of the health problems. All self-assessments and vignettes were rated on the five-point scale none, mild, moderate, severe and extreme.

A.1. Self-assessment questions
  • (a)
    Concentration: ‘Overall in the last 30 days how much difficulty did you have with concentrating or remembering things?’.
  • (b)
    Pain: ‘Overall in the last 30 days, how much of bodily aches or pains did you have?’.
A.2. Vignettes
A.2.1. Concentration
  • (a)
    v1: ‘Lisa can concentrate while watching TV, reading a magazine or playing a game of cards or chess. Once a week she forgets where her keys or glasses are, but finds them within five minutes. Overall in the last 30 days, how much difficulty did Lisa have with concentrating or remembering things?’
  • (b)
    v2: ‘Sue is keen to learn new recipes but finds that she often makes mistakes and has to reread several times before she is able to do them properly. Overall in the last 30 days, how much difficulty did Sue have with concentrating and remembering things?’
  • (c)
    v3: ‘Eve cannot concentrate for more than 15 minutes and has difficulty paying attention to what is being said to her. Whenever she starts a task, she never manages to finish it and often forgets what she was doing. She is able to learn the names of people she meets. Overall in the last 30 days, how much difficulty did Eve have with concentrating or remembering things?’
A.2.2. Pain
  • (a)
    v1: ‘Paul has a headache once a month that is relieved after taking a pill. During the headache he can carry on with his day-to-day affairs. Overall in the last 30 days, how much of bodily aches or pains did Paul have?’
  • (b)
    v2: ‘Henri has pain that radiates down his right arm and wrist during his day at work. This is slightly relieved in the evenings when he is no longer working on his computer. Overall in the last 30 days, how much of bodily aches or pains did Henri have?’
  • (c)
    v3: ‘Charles has pain in his knees, elbows, wrists and fingers, and the pain is present almost all the time. Although medication helps, he feels uncomfortable when moving around, holding and lifting things. Overall in the last 30 days, how much of bodily aches or pains did Charles have?’

Supporting Information

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Parametric model
  5. 3. Misspecification tests
  6. 4. Non-parametric approach
  7. 5. Data
  8. 6. Simulations: how to choose the cells
  9. 7. Empirical results
  10. 8. Conclusion
  11. Acknowledgements
  12. References
  13. Appendix
  14. Supporting Information

Testing the specification of parametric models using anchoring vignettes on-line appendix

FilenameFormatSizeDescription
rssa12000_sm_AppendixS1-S8.pdf157KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.