SEARCH

SEARCH BY CITATION

Keywords:

  • Causal variables;
  • Clinimetric scales;
  • Composite scales;
  • Construct validity;
  • Measurement scales;
  • Multi-item scales;
  • Quality-of-life instruments

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

There is extensive literature on the development and validation of multi-item measurement scales. Much of this is based on principles derived from psychometric theory and assumes that the individual items form parallel tests, so that simple weighted or unweighted summation is an appropriate method of aggregation. More recent work either continues to promulgate these methods or places emphasis on modern techniques centred on item response theory. In fact, however, clinical measuring instruments often have different underlying principles, so adopting such approaches is inappropriate. We illustrate, using health-related quality of life, that clinimetric and psychometric ideas need to be combined to yield a suitable measuring instrument. We note the fundamental distinction between indicator and causal variables and propose that this distinction suffices to explain fully the need for both clinimetric and psychometric techniques, and identifies their respective roles in scale development, validation and scoring.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

Many multi-item scales are used in medicine and psychology. Examples may be found in all branches of medicine. Their use is even more frequent in psychology, reflecting the fact that many psychological attributes cannot be directly observed or measured. The principles of scale design and development are well documented, and many books describe methods of item selection, content validation, criterion validation, construct validation, reliability assessment, scaling and analysis (e.g. Lord and Novick (1968), Nunnally and Bernstein (1994) and Streiner and Norman (1995)). Most of the techniques take their origins in psychometrics and the social sciences, where researchers such as Likert (1952), Guttman (1944), Cronbach (1951) and Spearman (1904) developed scales for assessing intelligence, personality, educational attainment, mood states, opinion, ability and so on. These names are commemorated in the eponymous scales, coefficients and tests that continue to feature so prominently in statistics and psychometrics. The common feature of these techniques is that there are assumed to be one or more underlying psychological concepts, frequently described as constructs, that are not directly measurable. The postulated constructs are often termed latent variables. The above measurement techniques constitute what is now commonly termed the traditional psychometric approach to scaling. In this approach, the scores on the items in a measurement scale are regarded as continuous variates that reflect (with error) the subject's level on that concept for which a measure is desired. This leads to methods that are equivalent to linear models. Standard methodological books on the subject still tend to be associated with psychometrics, intelligence testing, personality testing and education, although the same methods are widely used whenever a multi-item instrument is developed.

More recently, modern psychometrics has adopted a probabilistic approach to the development of scales, and this has led to an increasing emphasis on item response theory (IRT) as the framework for the development of measurement scales (Hambleton et al., 1991; Nunnally and Bernstein, 1994). In contrast with the traditional approach, IRT adopts the perspective that higher levels of the latent variable are reflected by an increased probability that the subject will respond positively to each item. Since responses to most items are either dichotomous (e.g. `right' or `wrong' in educational tests) or ordered categorical (e.g. severity level of a mood state), this leads to models in which each of the multiple items in a measurement scale follow logistic (binary or ordinal) models.

All of the above is all very well, but in both clinical medicine and health-related quality-of-life (QOL) research many scales include a less homogeneous set of items than are found in psychometric tests, such as symptoms of disease and side-effects of treatment. An example is tumour staging in cancer, which is based on cancer site-specific combinations of T (tumour) stage, N (nodal) stage and M (metastases) (Hermanek and Sobin, 1992). An earlier example is the Apgar score, measuring the state of health of newborn infants, which is based on a combination of disparate items, including body colour, heart rate, respiration, reflex response and muscle tone (Apgar, 1953). Feinstein called such scales clinimetric (Feinstein, 1987a) and subsequently wrote a highly influential book on the same title (Feinstein, 1987b). He defined clinimetrics to refer to

`arbitrary ratings, scales, indexes, instruments or other expressions that have been created as “measurements” for those clinical phenomena that cannot be measured in the customary dimensions of laboratory data'.

Feinstein argued that the aim of psychometricians is to create scales in which the multiple component items are all measuring more or less the same single attribute, but that this is in contrast with the aim of clinicians, which is to choose and emphasize suitably the most important attributes to be included in the index, using multiple items which are not expected to be homogeneous because they indicate different aspects of a complex clinical phenomenon. Thus there are two types of scale, psychometric and clinimetric, with different aims, and hence their development should follow different paths.

In fact, there is also a third type. The above two types might both be characterized as internal. Although they may aim to quantify a condition or to predict future outcomes, there is no gold standard measurement representing the true value of the outcome. Internal scales are either based on theoretical models for the underlying process or represent indices that serve as definition of the outcome measure. In contrast, an external scale will be constructed on the basis that it is a good predictor of some measurable gold standard or future state. External measurement scales need not be based on a theoretical model. Instead, an empirical model is chosen purely on the basis of its predictive power. An example is the prediction of cure or survival from clinical prognostic factors models. For pragmatic reasons weighted sums are often used, derived from regression analysis. Of course, the difficulty with this approach is that an observable gold standard is required, to provide a dependent variable that is to be predicted. Without such a gold standard we are forced to fall back on internal approaches, based solely on the relationships between the items that have been measured (the psychometric approach) or on the required relationships between the observed items and the attribute for which an index is being defined (the clinimetric approach). We shall not be concerned with external scales here.

The two internal approaches, psychometric and clinimetric, result in scales containing different items. There continues to be much debate about which should be the preferred approach, and about how to choose between the approaches. In this paper we argue that some items should be regarded as `causal' variables, and that this distinction accounts for the differences between psychometric and clinimetric scales, and the need for both. Awareness of causal variables will enable investigators to select the appropriate approach for the development of scales and explains the seemingly strange differences that have been observed by those who have made empirical comparisons of the two approaches.

We first illustrate typical problems by using a few published examples, and we shall examine these throughout the paper.

2. Examples

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

2.1. Example 1: construction of questionnaires—a retrospective study

The initial stages of the development of a questionnaire consist of first generating a comprehensive pool of potential items, and then selecting items from the pool for the final questionnaire. Juniper et al. (1997) developed an asthma QOL questionnaire and contrasted psychometric methods, in particular factor analysis, against a clinimetric approach using patients' opinions of importance (`impact').

Psychometric analysis resulted in a 36-item instrument, compared with 32 items by the impact method. Only 20 items were common to both. The impact method was chosen for the final instrument, on grounds that this corresponded to clinical sensibility. The psychometric method would have excluded three items of greatest importance to patients, and other items relating to important functional impairment would have been omitted; some arguably less important psychological items would have been included.

`We believe that all items of functional impairment that are important to patients, irrespective of their association with each other, should be included in a disease-specific quality of life instrument and therefore we use the impact method'.

2.2. Example 2: construction of questionnaires—a prospective study

Marx et al. (1999) carried out a study to construct a disabilities of the arm, shoulder and hand scale, using clinimetric and psychometric techniques including factor analysis, Cronbach's α and the exploration of correlation coefficients. They found that the clinimetric approach selected predominantly symptom items, whereas the psychometric method resulted in almost exclusively physical disability items. They commented `this research confirms the results of Juniper et al. … in a parallel, prospective, and blinded fashion' and suggested that `perhaps a combination of both methods may provide clinicians with the most appropriate scales'.

2.3. Example 3: instrument validation

The Rotterdam symptom checklist was until the mid-1990s a widely used QOL instrument for cancer clinical trials. Fayers and Hand (1997) reported seven trials and studies that used factor analysis to validate the scale structure of the Rotterdam symptom checklist. We found that there was agreement between researchers that the first factor represents general psychological distress and is a broad average of psychological problems, whereas the other factors are a combination of physical symptoms and side-effects. There was a considerable divergence of opinion about details of these physical subscales, and dispute about both the total number of factors and which items should be combined in the factors, with claims that there are two, four, five, seven or even nine factors.

Using data from a UK Medical Research Council clinical trial for colorectal cancer, we applied factor analysis and obtained a four-factor solution. In common with other studies, the first factor was general psychological distress. However, the second factor, a general `symptom' factor, contained a seemingly strange combination of items—including `lack of appetite', `decreased sexual interest' and `dry mouth'—and was completely different from the second factors reported by others. In our case, the other two factors were nausea and vomiting, and pain or aches. Thus we confirmed that factor analysis of the Rotterdam symptom checklist results in seemingly strange factor structure that varies from study to study.

2.4. Example 4: construction of scales

Bjordal et al. (1999, 2000) reported the validation of a QOL module for patients with head and neck cancers, European Organization for Research and Treatment of Cancer (EORTC) QLQ-H&N35. Psychometric `scaling analysis' revealed a few scaling errors:

`The main problem was the item about painful throat. From a clinical perspective, it was understandable that this item had a higher correlation with the Swallowing Scale. However, … it was decided for clinical reasons to leave this item in the Pain Scale.'

Also, some scales had low Cronbach's α:

`Some scales in the current module consist of items assessing different but related clinical aspects, such as the items in the Speech and Social Contact Scale. The aggregation of these items is based upon clinical grounds more than on psychometric theory. The reason for such subscale construction is the need for clinically sensible summary scores—not necessarily to make a scale with better internal reliability.'

They concluded:

`Most of the scales represent clinical scales, in which the items “hang together” in a clinically sensible manner, but they are not necessarily highly correlated. Internal consistency with high Cronbach's α is not so relevant in such scales.'

3. Indicator variables and causal variables

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

The majority of the items to be found in personality tests, intelligence tests, educational tests and other psychometric assessments reflect a level of ability or a state of mind. This has two implications. Firstly, implicit in this approach is the notion that the `thing' being measured exists—we are not merely defining it in terms of the variables that we choose to measure (and combine in some way), but each of these variables is assumed to have some relationship to an underlying concept which we are trying to measure. (Of course, the measure of the underlying concept can only be defined in terms of the variables that we have measured, but that is a different issue and is merely a reflection of the inadequacy of our collection of measured variables.) Secondly, the items do not alter or influence the underlying concept: they are merely aspects of it, or indicators of its magnitude. Such items have been given various names, including `effect indicators' and `manifest response variables'. We shall simply call them indicator variables.

In contrast, many scales from other fields, such as QOL scales, are not constrained to include merely indicator variables. They can include such variables, but they can (and typically do) also include variables which are part of the definition of what the concept being measured means. These measurements are very similar to the operational measurements of Hand (1996). This also has two implications. Firstly, it means that sometimes we are defining the thing being measured in terms of the variables that we select to measure it. In contrast with the psychometric approach, we are not postulating that something exists but are merely constructing an index which is convenient for some purpose. The variables, therefore, need not be indicator variables for the concept in question. It follows from this, and this is the second implication, that we can, in some sense, frequently regard these variables as `causal' since, if they are present (score highly, say) then the concept in question is present. Thus, in the case of QOL, a scale might include a measurement of pain as a component—not a result of low QOL, but a likely cause of it. Similarly, symptoms of a disease (while, perhaps, being indicator variables for that disease) or side-effects of a treatment could have an adverse effect on QOL. Clearly the symptoms and the side-effects are not indicator variables for QOL: they are not aspects of it, and they need not be present for low QOL to manifest itself. But if they are present then it is likely that the patient will have a low QOL score. Given all of this, it seems appropriate to characterize these as causal variables. In what follows we shall use a disease symptom as an example of a causal variable for low QOL.

The terms `causal indicator' and `effect indicator' are widely used in the field of structural equation modelling (Bollen, 1989). For us, however, these terms are not ideal—which is why we have dropped the word `effect'. In general, effects are the consequences of causes, but in our situation our `indicator' variables need not be consequences of the latent variable being measured but reflect aspects of it. They are fundamental to it, not necessarily consequences of it.

The field of structural equation modelling uses path diagrams to display the relationships between the variables, and to show the distinction between causal variables and indicator variables. However, the implications of this distinction have been slow to filter through to papers concerning the development of scales. Furthermore, papers that apply structural equation modelling to QOL instruments have only recently mentioned causal relationships (Romney et al., 1992; Molzahn et al., 1996; Romney and Evans, 1996) and have not recognized the additional problems that are posed by sufficient causes, as described below. Feinstein's (1987b) seminal book on clinimetrics neither explores the concept of causal relationships nor lists such words as `cause' or `causal' in its index.

Although much of this paper describes `models' of one kind or another, we must not forget that our ultimate aim is to produce an instrument that will yield a score. That is, we want to be able to combine the values of measured variables, preferably in a relatively simple way, to yield an index of (in our case) QOL.

3.1. Examples

In example 1, the items of functional impairment were described as being important to patients and having `impact' on them; these items were implicitly being regarded as causal, although that term was not used in Juniper et al. (1997).

In example 3, the second factor for the Medical Research Council data contained items for lack of appetite, decreased sexual interest and dry mouth—all of which are common symptoms of interferon treatment, and all likely to affect a patient's QOL. Thus they are likely to be causal variables, as are items of the third factor (nausea and vomiting) and fourth factor (pain or aches). The first factor, general psychological distress, is more plausibly a combination of indicator variables, reflecting the patient's current level of QOL.

In example 4, the pain scale was being defined by components such as `painful throat', and the authors recognized that this was a summary index, noting `the need for clinically sensible summary scores'.

4. Necessary and sufficient component causes

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

The distinction between indicator and causal variables has various implications for the way that we construct our measuring instruments. Sometimes the presence of just one of the causal variables may by itself serve to produce the outcome value: it can be a sufficient component cause.

For example, most patients who experience severe pain may as a consequence report a poorer QOL, and thus pain may be a sufficient cause for poor QOL. However, patients without any pain but who have other severe symptoms may also have a similarly poor QOL. Pain is thus an example of a sufficient but not necessary cause of reduced QOL. In general, a patient may suffer from one or more of several symptoms or side-effects, each of which may be regarded as causal for low QOL. This means that a QOL instrument should be such that a score on any one of the causal variables leads to a low QOL score.

In epidemiology the notion of causal variables is highly developed. In particular, Rothman (1976) is a classic paper on causes, introducing the concept of necessary and sufficient causes. These concepts are well understood in the epidemiological context, but there are two limitations in adopting the results for our purposes. Firstly, most statistical theory that has been developed by epidemiologists for causal models has focused on probabilistic binary outcomes, such as the risk of catching an infection or other disease. Secondly, the priority in epidemiological research is establishing causal mechanisms (aetiology) from a multitude of candidate risk factors. This is very different from choosing potential causes to include in a single index measure. Moreover, tests of a multicausal hypothesis predominate. After identifying causes, both necessary and/or sufficient, the interest then becomes that of estimating the contribution of each cause to the overall relative risk of the observed outcome (Rothman, 1986). This also is very different from the aim in the development of scales, where the objective is to determine a level for a single outcome measure such as QOL.

For the development of clinical scales, we are primarily concerned with sufficient component causes. Necessary causes are rarely relevant. For QOL, any one of a number of serious symptoms may be sufficient to cause poor overall QOL. More commonly, several symptoms may form components of the net QOL reduction. However, it has been noted

`should the component cause model give an accurate description of causation, many of the most popular statistical techniques would have to be revised. They build upon assumptions of additivity or multiplicability which only occasionally will be expected to fit the model. Furthermore, they focus upon estimating very simple and reduced summary statements which are quite irrelevant…'

(Olsen, 1993). Thus the concept of component causes leads to an entirely different procedure for creating summary scores. This can be illustrated by considering an artificial example. As we noted earlier, most clinical and psychometric scales use simple summated scores (or, equivalently, averages). Suppose that two factors, pain and vomiting, are each sufficient causes for explaining a reduced QOL. For each factor an independent function can be derived expressing its relationship to overall QOL. Then a logical scoring system might be the maximum of the pain function and the vomiting function, rather than their sum which would dilute a single high score by the presence of other lower scores. In practice, most symptoms are likely to be component causes that can combine and interact in various ways, and we are confronted by the problem of how to use them in a single model, with or without indicator variables.

Recognition of the relevance of sufficient causes to QOL assessment is hardly new: Aristotle (384–322 BC) noted that (Rakham, 1926)

`When it comes to saying in what happiness consists, opinions differ… and often the same person actually changes their opinion. When falling ill, it is said to be health; when hard up, it is said to be money.'

Given this, it is all the more surprising that researchers in QOL have not taken an explicit account of sufficient causes in QOL work.

5. Psychometric and clinimetric approaches

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

5.1. Traditional psychometrics

The traditional psychometric approach characterizes the underlying concept as a latent variable, with known relationships to multiple observed or manifest variables. The items in a scale are usually intended to be independent parallel tests, such that the items are estimates of the same true score (corresponding to the latent variable) and have uncorrelated error terms with zero means and equal variances. Most psychometric methods remain applicable to a more generalized form of test, called τ-equivalent tests, in which the constraint of equal variance error terms is relaxed. For parallel or τ-equivalent tests the items only appear to be correlated because of their relationship with the latent variable, i.e. they are conditionally independent for any given value of the latent variable.

In the traditional approach a model is built by the use of scaling methods, or factor analysis and its generalizations, and parameters are estimated which permit the individual patients' scores on the latent factor to be determined. A combination of multiple observed variables is used because this is more reliable and less prone to random error than single-item measures are, as well as yielding superior discriminatory power. The linear nature of the models means that the estimated value of the latent variable is given as a weighted sum of the observed variables. Details of the basic models may be found in Bartholomew (1987) and Basilevsky (1994).

Clearly such models may be appropriate for situations just involving indicator variables—e.g. for depression (Dunn, Sham and Hand, 1993). By assumption, any covariances between the indicator variables are solely attributable to their relationship with the common underlying factor. However, causal variables behave differently. Covariances between causal variables may exist irrespective of their relationship with the latent variable, this time induced by the common cause of these variables. If, for example, several symptoms of a disease (indicators for the disease, but causal for QOL) are included among the items, then they are likely to be correlated through their relationship to the severity of disease, even after we condition for the value of QOL. Thus causal variables do not constitute parallel or τ-equivalent tests. Including these symptoms in a global factor analysis with the indicator variables would be inappropriate for two reasons. First, the fact that they are causal, not indicators, for QOL, would mean that they need not have the proper relationship to the underlying factor tapped by the indicator variables. And, secondly, their mutual correlation suggests another factor, distinct from that leading to the scores on the indicator variables.

In summary, the basic traditional psychometric approach is only really suited to situations involving indicator variables, and indicator variables alone.

An extension of the basic factor analysis model called a MIMIC model (standing for multiple indicator, multiple cause; see, for example, Dunn, Everitt and Pickles (1993)) is, at least at first sight, more appropriate. Whereas, in the factor analysis model, all the manifest variables are influenced by the latent factor, in a MIMIC model multiple causes together influence the factor, which itself then influences the indicators. In view of our discussion about causal and indicator variables earlier, this model is certainly nearer the mark. However, even such models are not ideal. In particular, since MIMIC models are linear structural relational models, they do not handle component causes properly—the central QOL factor is taken as a weighted sum of the causes, rather than something like the maximum of them, which we have argued is more appropriate.

5.1.1. Examples

In examples 1–3 factor analysis was used. In all three examples, items that are most plausibly causal (notably, symptoms) were either automatically dropped from the factor model irrespective of their clinical importance or, as in example 3, resulted in apparently strange factors that varied from study to study.

5.2. Item response theory

IRT is a more recent development. The early forms were based on responses to binary items. The model provides probabilities for the responses of the patients or subjects in terms of the difficulty of the items and the ability (for example) of the subjects. In an educational setting, students' ability represents the latent variable, and the items, which are examination questions or tests, are chosen to be of varying difficulty.

Recent extensions have generalized this basic form (see, for example, Hambleton et al. (1991)). Item response models have not often been applied in clinical contexts, but there has recently been increasing interest, as illustrated by the examples of its use with QOL instruments given by Haley et al. (1994), Cella et al. (1996), Stucki et al. (1996), McHorney et al. (1997) and Raczek et al. (1998). As with factor analytic models, in the basic IRT model, correlation between items arises solely because the items are tapping a single underlying continuum.

As before, IRT models are inappropriate to handle situations involving causal variables. Once again, independence of causal variables conditional on given values of the underlying QOL will not hold in general. Causal variables may be correlated by virtue of having been caused by disease or treatment. This conflicts with the assumption of local independence that is made to estimate the parameters in IRT models. Moreover, the frequency of occurrence of a symptom does not relate to its importance in leading to low QOL: pain, for example, might occur with the same frequency in the sample as some other minor symptom, and yet the impact on QOL might be much greater.

5.3. Clinimetrics

The objective of the psychometric approaches outlined above might be characterized as attempting to measure a single attribute by using multiple items. In contrast, clinimetric methods attempt to summarize multiple attributes with a single index. Now, of course, this cannot be done—a single index loses the intrinsic differences between the attributes and it also sacrifices any possibility of allowing for an interaction (in the presence of attribute A, attribute B leads to poor QOL, but in the presence of C it does not). Instead the researchers aim their strategies at choosing and suitably emphasizing the most important attributes to be included in the index (Wright and Feinstein, 1992). In fact, of course, we are choosing how to define the concept being measured by our choice of variables and the way of combining them. This is why we have used the verb `to construct' in the context of clinimetric scales, but not in the context of psychometric scales. Clinimetric scales are different from psychometric scales because, in the former, the items to be included are chosen according to what we want the scale to do, whereas in the latter they are chosen because they are thought to be related to an underlying concept which defies explicit measurement. (The issue of component causes is a further complication.) The psychometric approaches are based on models of the relationships between the underlying latent variables and the measured variables. This model permits us to infer backwards from the measured variables, to deduce the likely value of the unobserved latent variable (through factor analysis or IRT). In the psychometric case, sophisticated statistical methods are used to ensure that the items are homogeneous in measuring a single construct, and that the underlying continuum is unidimensional. It will be clear from the above characterizations that the very question of homogeneity is contrary to what the clinimetric approach is trying to achieve.

The clinimetric approaches are based on a deliberate choice of what variables to include and, in the absence of an underlying model, a deliberate choice for how those variables should be combined. In the case of component causes, outlined in Section 4, this is that any large component cause is sufficient to imply poor QOL. No data analysis is needed, nor is appropriate, to decide how to combine the individual constituents of the clinimetric model (but see Section 9). Any relative importance to be attributed to the individual components must come from outside the data—from the objectives of those constructing the scale. This should not be regarded as a shortcoming: it forces the constructors of a scale to decide exactly what it is they want their index to measure, and to make this public.

In summary, practical distinctions between psychometric and clinimetric scales are that the former will—or should—only contain indicator variables whereas the latter will typically include one or more causal variables. Also, psychometric scales should be unidimensional, in the sense that all items reflect a single latent variable, whereas clinimetric scales may consist of a composite index that combines items representing several distinct latent variables.

6. Validating scales

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

The validation of a scale is crucial. Without examination and evidence of the effectiveness of a scale for its purpose, the scale is useless. Worse, it may even be misleading. In this section we examine a few aspects of validating scales. A more comprehensive description of scale development and validation is given in Fayers and Machin (2000).

Many methods of validation rely heavily on the analysis of inter-item or interscale correlations. This possibility, which obviously only exists when scales comprise multiple items, is invaluable: `with a single measure of each variable one can remain blissfully unaware of the possibility of measurement error' (McIver and Carmines, 1981). However, caution must be exercised when using such approaches with scales containing causal variables.

Essentially, such methods seek to explain the correlation between different items in terms of a postulated underlying factor. However, we have already noted that causal items are frequently correlated by virtue of being themselves caused by another factor, external to the target concept. For example, when measuring QOL in cancer patients, chemotherapy may induce one cluster of correlated symptoms, whereas radiotherapy might result in a rather different correlation pattern; oesophageal cancer patients may have one set of eating-related problems, whereas laryngeal or gastric cancer patients may have different patterns. Thus the correlations between causal variables will frequently be due to extraneous factors, not QOL, and will vary from study to study.

Furthermore, when these items form component causes, they may have very low correlations yet still be deemed to belong to the same scale. Thus, sometimes a low correlation between two causal items may even provide evidence for the need to retain both—each may be an important independent sufficient cause for the concept with which we are concerned.

Hence those methods of validating scales that depend on an analysis of inter-item correlations will generally be highly suspect when causal items are present and may lead to misleading or meaningless results.

In the following subsections we look at various aspects of the evaluation of scales and explore the effect of the presence of causal variables.

6.1. Content validity

Content validity is concerned with the extent to which the items comprising a scale cover all aspects of the latent variable and no additional features. In educational tests, for example, it would be unreasonable to include items that are not related to the syllabus on which the students are being examined, and to do so would result in a test which is not assessing what it purports to assess. At the same time, the test should cover a wide range of relevant aspects from the syllabus, since otherwise there could be undetected differences in knowledge or ability between the students. This is more important for causal variables than for indicator variables. For example, if QOL is being evaluated through the presence of symptoms, it is essential to ensure that all the disease-specific symptoms that might affect QOL are covered. If a serious but unexpected or unanticipated side-effect occurs, the QOL may be greatly affected. In contrast, by virtue of the fact that indicator variables are tapping the same underlying concept, the omission of one is not so crucial.

6.1.1. Example

In examples 1 and 2, psychometric analyses would have resulted in the omission of items that were rated as clinically important or rated by patients as being of high impact. In these examples the authors recognized the need to retain all these items and realized that clinimetric methods resulted in superior scales.

6.2. Construct validity

Construct validity embraces a variety of techniques for assessing the degree to which an instrument measures the concept that it was designed to measure. In practical terms, it is concerned with testing dimensionality (is the assumption that there is a single latent variable supported by the evidence?), testing homogeneity (do all the items appear to be tapping into the same latent variable?) and, if there is more than one latent variable, testing the extent to which they overlap (do some items from one subscale correlate with other latent variables?).

Construct validation is best seen as a process of learning more about the joint behaviour of the items, and of making and testing new predictions about this behaviour. Factor analysis is, of course, a key technique in this process. Unfortunately, since factor analysis is based on the covariance structure between the items, it is as we have seen of limited relevance when causal variables are present. A global factor analysis of the items may give very misleading information about the latent variable of interest if causal variables are present; at best, it may reflect such constructs as clusters of disease-related symptoms (syndromes) or therapeutic side-effects, rather than constructs reflecting dimensions of QOL. However, factor analysis of subgroups of items might lead to useful ways of condensing those causal variables for QOL that are indicators of some common external influence, such as treatment. We return to this possibility in Section 9.

6.2.1. Example

In example 3, factor analysis resulted in a `symptom factor' that contained items which were unexpected in terms of their relationship with QOL. These items were merely the common side-effects of the particular form of therapy that was used in the Medical Research Council trial and were correlated through their relationship with treatment, not QOL. This also explains why other investigators had reported apparently unstable factors, differing from study to study according to the disease subgroup and the therapy being applied.

6.3. Reliability

The reliability of a scale (Dunn, 1989) is the extent to which the scale yields reproducible and consistent results. Confusingly, this word is used for two distinct aspects of validating scales. Firstly, under traditional psychometric theory, multi-item scales should be homogeneous with high internal reliability or internal consistency. Secondly, measurements and scores should be repeatable, in that, if the test is applied on two occasions to a patient whose condition is stable, the results should be similar. We discuss repeatability–reliability in the next subsection.

The most common method for assessing internal consistency is Cronbach's α (Cronbach, 1951), which is a form of intraclass correlation. It is closely related to convergent validity, i.e. the extent to which the items in a scale are all highly intercorrelated. If the items are uncorrelated α=0, whereas if the items are identical α=1. This is appropriate for measures involving indicator variables, since any subset of those variables will be tapping the same underlying concept—if the model is correct. However, Cronbach's α is generally unsuitable for situations involving causal variables, firstly because dropping such a variable from the measure clearly substantially distorts it and secondly because as we have seen the inter-item correlation structure will depend on extraneous factors. In some special cases, however, something along these lines might be possible. For example, if each causal variable is, itself, a latent variable derived from several manifest variables, then dropping components of each causal submodel might be appropriate.

6.3.1. Examples

In example 4, the authors noted that Cronbach's α was low for some scales and attributed this to the heterogeneous nature of the clinical summary index. The items concerned were invariably those that were obviously causal. Marx et al. (1999), in example 2, also found that whereas their psychometric strategy resulted in scales with high α, as expected, this was not the case when they used the clinimetric approach.

6.4. General issues of validation

In the preceding subsections we pointed out the difficulties of assessing reliability and validity when causal variables were involved. The fact that it is difficult does not detract from its importance.

Sometimes some indications of validity can be gained from adaptations of methods aimed at tapping a hypothetical gold standard. For example, known groups validity is based on the hypothesis that certain specified groups of patients are expected to score better (or worse) than others. Thus patients with advanced cancer might be expected to have a poorer QOL than those with early disease. If the scale is valid it should show differences between these groups, and the differences should be in the expected direction. If a scale cannot successfully distinguish between groups for whom there are known differences, either because it lacks sensitivity or because it yields results that are contrary to expectations, the scale is hardly likely to be of value for many other purposes.

A similar argument applies if more than one scale, purportedly assessing the same or similar concepts, is available. However, if clinimetric scales are not identical, they are defining different summary indices and should not be expected to yield identical scores. In general, even for psychometric scales, we would rarely expect two instruments to give identical scores, and when they do not there remains the question which is the better? Problems of interpretation become simpler if one scale is regarded as the gold standard, as for example when the objective is to produce a shortened form that is almost as effective as the original.

The usefulness of a measure depends on its ability to detect clinically significant differences. This is often described in terms of sensitivity (the ability to detect differences between groups) and responsiveness (the ability to detect changes over time within a patient). Sensitivity can be investigated in a study comparing groups of patients, and responsiveness can be explored by longitudinal studies. Different scales can then be compared by using such criteria as the standardized response mean (the ratio of the mean change to the standard deviation of that change, or the effect size (the ratio of the mean change to the standard deviation of the initial measurement), as in Fayers and Machin (2000). Again, these are not affected by the inclusion of causal variables in the model.

Test–retest studies of patients who are stable should yield consistent results—the level of agreement between the two occasions is a measure of the reliability or repeatability of the instrument. If the period of stability is sufficiently long that memory effects can be ignored, then this is an approach that can be adopted even with scales which include causal items. For example, patients with stable disease and not receiving treatment might perhaps be expected to have a stable QOL over periods of a few weeks. Of course, the determination of stability then must be made by using some external criterion.

Interrater reliability is another aspect of repeatability. Do we expect independent observers to obtain comparable results and, if so, how large is the difference? Traditional test theory defines reliability as the ratio of the variance of the true scores to that of the observed scores and uses measures such as intraclass correlations or Cohen's κ to standardize the results relative to the amount of agreement that we would expect to achieve by chance. More recent developments in generalizability theory use analysis of variance to decompose errors into different sources (Dunn, 1989). However, in QOL assessment, the rationale for `asking the patient' is that only the patient can give a meaningful assessment of their own QOL, and so the use of other raters is dubious.

6.5. Item selection

In all questionnaires the selection and wording of items is of crucial importance, and each item should be carefully evaluated before it is included (Fayers and Machin, 2000). In particular, new questionnaires should be tested by using pilot studies. For scales consisting solely of indicator items, standard psychometric methods can be applied. In particular, any items that are poorly correlated with the others in the same scale might be removed, since this suggests that they are tapping into other factors. Equally, very high correlations may indicate that some items are redundant.

For causal items, however, the situation is somewhat different. They will have been chosen on the basis that they constitute an important aspect of the concept—are intrinsic to the definition of that concept. Therefore they cannot be dropped, without careful thought for what this implies about the nature of the latent variable under investigation. Breadth of coverage is particularly important here, and it is crucial to ensure that all relevant causal areas are included (Fayers et al., 1998). For QOL, usually this will be based on interviews with patients.

In practice, the optimal approach would seem to be to combine the two methods, relying on psychometric techniques for the presumedly homogeneous scales that comprise items which are likely to be indicator variables; factor analysis enables the exploration of dimensionality, and internal reliability and consistency can be assessed by using standard techniques. In contrast, the items that are ostensibly causal, including in particular symptoms and side-effects, are best handled by such clinimetric techniques as asking the patients to rate their impact.

6.5.1. Examples

Given the above, it is not surprising to note that clinimetric and psychometric strategies result in scales which include different sets of items, as seen in our examples 1 and 2. In example 1, fewer than half the items were common to both approaches. Fayers et al. (1998) observed that the notion of causal variables sufficed to explain all the differences that Juniper et al. (1997) had noted in example 1: the items deemed important but excluded by the psychometric approach were without exception causal items (problems associated with cigarette smoke, having to avoid dust, weather or air pollution, etc.). However, those that would have been included if psychometric methods had been used but excluded using clinimetrics were indicator variables that reflected QOL but were regarded by patients as of less importance. Similarly, in example 2, which sought to construct a disabilities scale, the clinimetric approach ranked 12 symptom items and three psychological disability items in the top 30, whereas the psychometric approach selected almost exclusively physical disability items— these serving as indicator variables for the disabilities scale.

7. Scoring

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

As we have noted earlier, indicator scales are most frequently scored using simple summation—the items are scored using integers corresponding to severity, and these scores are summed. This, of course, is based on the notion that there is an underlying factor contributing to each measured item. Since different subscales may contain different numbers of items, it is common either to use an average instead of the total score or, more frequently, a score that has been standardized to lie between 0 and 100 (Ware et al., 1993; Fayers et al., 2001; de Haes et al., 1996). If the items form truly parallel tests with indicator variables, this is an eminently sensible procedure since each item is assumed to be an unbiased estimate of the same latent effect. In principle it ought to be possible to improve on this by giving different weights to each item, since it is very likely that some items are more important than others. However, various researchers have noted that in practice weighting makes little impact (Wainer, 1976; Dawes, 1979; Cox et al., 1992; Prieto et al., 1996). Equal weights may not be optimal, but they are usually adequate and it is obviously so much simpler to use equal weights. There are also implicit assumptions that all items are scored with the same number of categories, and that the categories correspond to an equal interval scale. For example, for the EORTC QLQ-C30 instrument (Aaronson et al., 1993; Fayers et al., 2001) it is assumed that the effect of moving from a score of `1 ≡ not at all' to `2 ≡ a little' is of equal magnitude to changing from `3 ≡ quite a bit' to `4 ≡ very much'.

Scales based on IRT will include items of varying difficulty and again a simple approximation might be to use a summation of scores on the grounds that the more able subjects are expected to obtain a higher proportion of `correct' answers. `Activities of daily living' scales are often used as a measure of physical functioning. These scales consist of a series of items of varying difficulty, such as being able to dress without help through to being able to carry heavy loads, to climb stairs or to take long walks. The items in such scales are often summed. However, IRT models allow an estimation of difficulties of items and hence more accurate methods of scoring—but it is still not clear whether the extra complexity of IRT scoring makes very much difference in the results (McHorney et al., 1997) although some researchers suggest that it can result in more sensitive scales (Raczek et al., 1998).

When causal variables are involved, especially component causes, both simple summation and weighted sums are less easy to justify. There is no common latent factor being tapped by each of the causal variables. They may be acting in an entirely independent way, so that any one, by itself, should lead to a poor QOL, if sufficiently severe. We noted earlier that under these circumstances an appropriate model might be to take the maximum of the causal variables. A variant on this idea has been suggested by Blalock (1982). For variables xj taking values between 0 and 1, with high values meaning that the cause is present, the value of y defined by

inline image

is large if any one xj-variable is high. This model is also equivalent to the multiplicative utility formula proposed by Torrance et al. (1996) for combining patients' preference values relating to items into an overall utility score. The classical linear models (based on factor analysis, for example), leading to simple weighted sums, are only appropriate for the indicator variables.

Thus indicator variables may often be combined by using unweighted sums; alternatively, if preferred, data analytic methods may be utilized. For causal variables, in contrast, the choice of weights βj is of crucial importance and these are either part of the definition of the latent variable or are a matter of opinion, based on for example patients' preferences or utilities.

8. Choice of weights for causal variables

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

A small value of βj means that the jth factor is regarded as relatively unimportant, but there remains the question of how to choose the βj in such a model. Since these represent the clinimetric aspect of the model, they should be chosen on the basis of how we want to weight the relative components of the model—how we want to define the measuring instrument. The choice of the βj-values is thus not a data analytic problem, but one of definition.

For QOL scales, since causal variables have an effect on patients' QOL, it would seem necessary to base any scoring method on some measure of impact derived by asking the patient, with item weights derived from patients' opinions. We could base the weights on a consensus derived from community estimates or average ratings of the relative importance (not severity) of the various components from a sample of patients, on the assumption that most patients will place approximately similar importance on the various symptoms and side-effects (McKenna et al., 1981; Kaplan et al., 1993). This is a widely used approach for methods based on psychometric models, but in fact patients may differ considerably on what they regard as important (Tugwell et al., 1987; O'Boyle et al., 1992), and patients even define the components of QOL for others differently from the way in which they perceive their own QOL (Montazeri et al., 1996). A few QOL instruments allow patients either to select the things that they regard as having the greatest effect on their life or to place relative importance on the items (Tugwell et al., 1987; O'Boyle et al., 1992; Hickey et al., 1996).

Other approaches have been proposed, for example based on scenarios representing disease states that are rated by patients. The multiattribute utility approach of Torrance et al. (1996) is closest to our model and involves an empirical assessment of patients' utilities for different combinations of item states. Other forms include judgment analysis, used by Browne et al. (1997), and discrete choice conjoint analysis (Ryan and Farrar, 2000).

No form of preference analysis and weighting is without problems. One problem, of course, is separating the relative importance from the current relative severity: severe pain today may mean that pain is weighted more highly than it was yesterday.

9. Combining causal and indicator variables

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

Frequently there is little need or justification for combining causal and indicator variables; often they naturally group themselves into distinct subscales that assess particular aspects, or `dimensions', of QOL. The purpose of subscales, especially those involving symptoms, is usually to assess one or more aspects of QOL in greater detail, to learn about the impact of disease or therapy. In general, we recommend that one attempts to create scales that avoid mixing causal and indicator variables.

Also, there is on-going debate about whether QOL is a multidimensional construct—for example, many claim that it has distinct dimensions such as emotional, physical, social and cognitive that are impossible to summarize by a single number. However, in contradiction, most people accept that it is possible to reply to global questions such as `All things considered, how would you rate your overall quality of life?' or even, in principle (although it might be considered unethical to ask patients), `Do you feel life is worth living?' and that therefore there can be a single latent variable. As we note in the discussion, many researchers advocate that, if one does require a global assessment of QOL, why not simply ask the patient?

However, we may at times seek to combine the two types of variable into a composite summary scale. Also, we may wish to use a modelling approach to explore the interrelationships and contributions of various items. This leads us to consider models that allow the combination of indicator and causal variables. The following is one such model, provided that we assume that we can characterize each variable as either solely causal or solely indicator. In our model we have taken QOL to be a measurable latent variable.

First, groups of indicator variables are each reduced to a score on their common single factor by a factor analysis. We shall require that the components of our model take values between 0 and 1, with 1 corresponding to poor values. If there are m factors, we write zi for the ith factor (i=1,…,m). Then, if the factor resulting from the indicator variables xi is Σjαijxij, taking (for illustration) arbitrary real values, a logistic transformation to

inline image

will yield a component for the overall model with the requisite properties.

The causal variables will frequently be single measured variables. However, sometimes they might themselves each be the result of a factor analysis of a set of related variables, as we suggested in Section 6; this could be applicable if, for example, a group of causal variables formed a stable factor representing an aspect of the disease or treatment, or if the causal variables were thought to reflect one homogeneous underlying causal factor. We shall denote these causal variables or causal factors by zm+1,…,zp (after logistic transformations, if necessary, so that they take values between 0 and 1).

The various components in the model—the factors derived from the groups of indicator variables, and the measured or derived causal variables—are then combined using a Blalock-type model:

inline image

In full, letting xij denote the jth component variable of the ith causal variable or the jth indicator variable from the ith group of indicator variables, where i=1,…,p,

inline image

The values of βi are obtained as described in the previous section.

Again we note that we are not proposing this as a complete solution, but merely a suggestion for one way forward. It does depend on the ability to decide which variables are causal and which are indicator—discussed further in the next section. In some situations other than the assessment of QOL this categorization is more straightforward. In particular, in situations where there is a chronological order, with the causal variables necessarily preceding the indicator variables, then the categorization is clear. Thus, for example, we might define a `propensity to respond well to medication' variable in terms of initial characteristics of the individual (age, sex, previous illnesses, and so on) as well as records of how well the patient has responded to other medications in the past. Here the ini- tial characteristics will correspond to our causal variables, and the previous responses to indicator variables.

10. Identification of causal and indicator variables

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

Although we have labelled variables as either causal or indicator, in fields such as QOL research many items will be of intermediate type. For example, we have described pain as causal and depression as indicator. In practice, depression causes patients to become more sensitive to pain, such that for a given pain stimulus they will report greater levels of pain. And pain causes more depression. Thus pain is often described as having two components (e.g. Melzack and Wall (1982)): the original sensation (due to intensity, location, etc.) and a reactive component (due to personality, emotional state, etc.). Clinicians know that they can treat either one of pain or depression, and both will be alleviated.

Variables may even also shift around. For example, cytotoxic chemotherapy for cancer commonly induces nausea and vomiting. Some cancer patients who have experienced these problems after their initial course of treatment may start vomiting before the administration of a subsequent course of treatment. Here an initially apparently causal variable (for QOL) has acquired an indicator aspect—although it is arguably still causal as anticipatory vomiting may reduce QOL yet further. Thus there may frequently be uncertainty and ambiguity about the precise role of variables in QOL assessment. Disease or treatment-related symptom clusters are likely to be predominantly causal, but it may be less clear whether psychological and other items are mainly causal or indicator in nature. However, any QOL variable that is largely causal will be subject to treatment- or disease-related correlations, and so psychometric techniques will not be applicable. Psychometric techniques apply only to indicator variables, for which the correlations arise solely by virtue of their relationship to the latent variable, QOL.

How can we identify causal variables? We do not have an estimate of the true value for the latent variable—for that is the purpose of constructing our scale to start with—and we have postulated that the usual psychometric estimation procedures are unreliable. Fayers et al. (1997) overcame this problem by arguing that the seven-point global question `How would you rate your overall quality of life during the past week?' provided a valid surrogate assessment of the latent variable QOL and compared individual items against this. A graphical method was used to examine whether high levels of individual item responses `caused' poor QOL, and whether conversely poor QOL was associated with high levels of the same items. Evidence of an asymmetrical relationship was taken as suggestive of component causes.

Another consequence of causal relationships that we can exploit is that inter-item correlations do not reflect the relationships with the latent variable. Thus in many situations causal variables will give rise to seemingly inexplicable factors structures.

10.1. Example

In example 3, Fayers and Hand (1997) showed that factor analysis of QOL data yielded an apparently strange combination of items as a single factor—strange in that the items did not make clinical sense as interrelated symptoms. They were merely common effects of the therapy being used on this particular group of patients. Other forms of therapy, with different side-effects, would have produced different combinations of these causal items. But it is difficult to use `strangeness' as a measure of whether variables are causal.

10.2. Tests for causal variables

For a more formal test, we can make use of the fact that causal variables are not parallel items and therefore lack conditional independence. Thus for QOL we might expect that the treatment or severity of disease would remain important determinants of the level of response for causal variables, and that this would result in correlations between the items, even after conditioning for the overall level of QOL (or a subscale of QOL, if a multidimensional model for QOL is used). This points towards the use of conditional logistic regression, or of psychometric methods for detecting differential item functioning (DIF). The concept of DIF is that the supposedly parallel items in a scale should, for any given value of the latent variable, be independent of each other and should be equally good indicators of the underlying factor. In education, for example, no single test item ought either to favour or discriminate against pupils of a particular gender or race; if it does, it is said to suffer from DIF and is a biased test item. Causal variables will exhibit DIF with respect to an extraneous factor, namely the one that they reflect—treatment or severity of disease in our case—and will not serve purely as parallel indicators of the target latent variable, QOL. A whole battery of tests has been developed for assessing DIF (Holland and Wainer, 1993), and our initial findings are that these tests provide a very effective and sensitive method for detecting variables that are wholly or partially causal in nature.

Tests for DIF can indicate variables that are not performing well as purely indicator variables, but no such data-based test can conclusively prove that a variable is causal. Bollen (1989) acknowledged that distinguishing these variables `can be troublesome'. Sometimes temporal changes may be informative; if a change in pain levels is observed to precede changes in QOL, it might imply that pain is causative. The difficulty in detecting such temporal effects is obvious. Therefore, the most convincing approach would seem to be a `thought test'. Imagine a large increase in pain. Is it likely that QOL would be affected? Now imagine a patient starting with a good QOL. If their QOL deteriorates markedly, do we expect them necessarily to be suffering physical pain? Do we expect their pain score to reflect the change in their QOL? The answer is surely no; they might have other reasons for their poor QOL. Therefore, pain is best thought of as a causal variable. Furthermore, for many items the whole reason for including them in a QOL scale is precisely because the clinician or patient intuitively realizes that they are consequences of the treatment and are essentially causing an impact on QOL.

For QOL scales, the seemingly naïve thought test will frequently suffice to distinguish indicator variables from those that are predominantly causal. Since the fundamental assumptions for psychometric methods are violated by causal variables, we suggest that one should err on the side of caution and treat as causal those variables whose causal or indicator nature remains uncertain after the thought test.

11. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

QOL is affected by causal variables and the level of QOL manifests itself in indicator variables. However, standard approaches to QOL measurement have been predominantly based on psychometric approaches, which implicitly assume only the latter kind of variables. This leads to either (Likert) summated scales or, more recently, scales based and scored on IRT principles. The principal focus of methods of construction and assessment has been the correlation structure of the data—which is inappropriate for causal variables. Gill (1995) has commented

`Quality of life, despite some promising recent efforts, currently lacks suitable strategies for its assessment. The absence of instruments suitable for measuring quality of life can be explained, in part, by two distinct but related phenomena: a slavish devotion to psychometric, as opposed to clinimetric, techniques and a failure to recognise the fundamental importance of patients' individual values and preferences.'

We suggest that an appreciation of the two types of variable, indicator and causal, will enable developers of scales to identify appropriate methods for the selection of items, the construction of scales and the validation of scales. The two principal approaches, psychometric (whether traditional or IRT) and clinimetric, both have important roles, but the choice of method should depend on the items comprising the scale. An uncritical application of either one method on its own is likely to result in two very different scales, and in both cases the result may be inadequate. Also, it is disturbing to note that anyone developing a scale by using traditional methods would remain blissfully unaware that they may be omitting important items or including inappropriate items. Practical details of how to apply both psychometric methods and the clinimetric approach to the development and validation of a scale are contained in Fayers and Machin (2000).

Generic instruments (non disease specific) for assessing QOL may be less affected by causal variables than those that are intended to be disease specific. The latter by definition tend to focus specifically on the symptoms that are associated with a particular disease, and on the commoner side-effects of therapies. Thus disease-specific QOL instruments can be expected to contain a high proportion of causal items. For these instruments, or at least for their causal items and subscales, a clinimetric approach should be used. Psychometric methods, although much used in the past, are not appropriate and lead to instruments that are either suboptimal or invalid.

The distinction between indicator and causal variables seems sometimes almost to have been recognized in the QOL literature: sometimes subsets of variables are combined to yield a subscale, and clinicians intuitively prefer to keep these subscales distinct. The EORTC QLQ-C30 instrument is designed to measure aspects of QOL, and thus contains five multi-item functioning scales and three multi-item symptom scales, and the scoring instructions for the QLQ-C30 instrument recommend that these items should not be combined to give a score for overall QOL (Fayers et al., 2001). Psychometric methods can be applied within subsets of indicator variables, as we have suggested above, but the subsets should not be combined with causal variables by using the same method.

This is all very well, but there is clearly a need to supplement the reporting of individual items and subscales with a summary measure that may be regarded as representative of overall QOL. A simple linear sum of causal variables may be useful in some limited situations—for example, a total of all serious symptoms might be regarded as a measure of `symptom burden'. However, it cannot be regarded as a measure of overall QOL. An obvious approach is, of course, simply to ask the patient. Many instruments do contain a global question such as `How would you rate your overall quality of life during the past week?' (Fayers et al., 2001), and it has been advocated that all QOL instruments should (Gill and Feinstein, 1994). Global questions allow patients to perform their own weighting and integration of aspects of QOL. The question might be asked in the context of a structured interview, which reminded the patients of the items to be taken into consideration when giving their answer. A third alternative is to devise a scale based entirely on items that are indicators of QOL and are not causal.

Finally, we have speculated about a fourth alternative, which permits the possibility of combining causal with indicator items in a proper manner. It goes without saying that this proposal has weaknesses. There are just too many aspects to it. In particular, things are complicated by the fact that it will not always be easy to characterize variables as either uniquely indicator or uniquely causal; as we have shown, it is possible that some may be partly indicator and partly causal. Despite such shortcomings, we believe that our proposal explicitly recognizes and permits the combination of the two qualitatively distinct kinds of measure in a way that appears not to have been suggested before. We also hope that it will stimulate further discussion and in doing so lead to a further development of the field.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References
  • 1
    Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N. J., Filiberti, A., Flechtner, H., Fleischman, S. B., De Haes, J. C. J. M., Kaasa, S., Klee, M. C., Osoba, D., Razavi, D., Rofe, P. B., Schraub, S., Sneeuw, K., Sullivan, M. and Takeda, F. (1993) The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J. Natn. Cancer Inst., 85, 365376.
  • 2
    Apgar, V. (1953) A proposal for a new method of evaluation of the newborn infant. Anesth. Analg., 32, 260267.
  • 3
    Bartholomew, D. J. (1987) Latent Variable Models and Factor Analysis. London: Griffin.
  • 4
    Basilevsky, A. (1994) Statistical Factor Analysis and Related Methods. New York: Wiley.
  • 5
    Bjordal, K., De Graeff, A., Fayers, P. M., Hammerlid, E., Van Pottelsberghe, C., Curran, D., Ahlner-Elmqvist, M., Maher, E. J., Meyza, J. W., Brédart, A., Söderholm, A. L., Arraras, J. I., Feine, J. S., Abendstein, H., Morton, R. P., Pignon, T., Huguenin, P., Bottomley, A. and Kaasa, S. (2000) A 12-country field study of the EORTC QLQ-C30 (version 3.0) and the head and neck cancer specific module (the EORTC QLQ-H&N35) in head and neck patients. Eur. J. Cancer, 36, 17961807.
  • 6
    Bjordal, K., Hammerlid, E., Ahlner-Elmqvist, M., De Graeff, A., Boysen, M., Evensen, J. F., Björklund, A., De Leeuw, R. J., Fayers, P. M., Jannert, M., Westin, T. and Kaasa, S. (1999) Quality of life in head and neck cancer patients: validation of the EORTC Quality of Life Questionnaire-H&N35. J. Clin. Oncol., 17, 10081019.
  • 7
    Blalock, H. M. (1982) Conceptualization and Measurement in the Social Sciences. Beverley Hills: Sage.
  • 8
    Bollen, K. A. (1989) Structural Equations with Latent Variables. New York: Wiley.
  • 9
    Browne, J. P., O'Boyle, C. A., McGee, H. M., McDonald, N. J. and Joyce, C. R. B. (1997) Development of a direct weighting procedure for quality of life domains. Qual. Life Res., 6, 301309.
  • 10
    Cella, D. F., Dineen, K., Arnason, B., Reder, A., Webster, K. A., Karabatsos, G., Chang, C., Lloyd, S., Steward, J. and Stefoski, D. (1996) Validation of the functional assessment of multiple sclerosis quality of life instrument. Neurology, 47, 129139.
  • 11
    Cox, D. R., Fitzpatrick, R., Fletcher, A. E., Gore, S. M., Spiegelhalter, D. J. and Jones, D. R. (1992) Quality-of-life assessment: can we keep it simple? J. R. Statist. Soc. A, 155, 353393.
  • 12
    Cronbach, L. J. (1951) Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297334.
  • 13
    Dawes, R. M. (1979) The robust beauty of improper linear models. Am. Psychol., 34, 571582.
  • 14
    Dunn, G. (1989) Design and Analysis of Reliability Studies. London: Arnold.
  • 15
    Dunn, G., Everitt, B. S. and Pickles, A. (1993) Modelling Covariances and Latent Variables Using EQS. London: Chapman and Hall.
  • 16
    Dunn, G., Sham, P. C. and Hand, D. J. (1993) Statistics and the nature of depression. J. R. Statist. Soc. A, 156, 6387.
  • 17
    Fayers, P. M., Aaronson, N. K., Bjordal, K., Groenvold, M., Curran, D. and Bottomley, A. (2001) EORTC QLQ-C30 Scoring Manual, 3rd edn. Brussels: European Organization for Research and Treatment of Cancer.
  • 18
    Fayers, P. M., Groenvold, M., Hand, D. J. and Bjordal, K. (1998) Clinical impact versus factor analysis for quality of life questionnaire construction. J. Clin. Epidem., 51, 285286.
  • 19
    Fayers, P. M. and Hand, D. J. (1997) Factor analysis, causal indicators, and quality of life. Qual. Life Res., 6, 139150.
  • 20
    Fayers, P. M., Hand, D. J., Bjordal, K. and Groenvold, M. (1997) Causal indicators in quality of life research. Qual. Life Res., 6, 393406.
  • 21
    Fayers, P. M. and Machin, D. (2000) Quality of Life: Assessment, Analysis and Interpretation. Chichester: Wiley.
  • 22
    Feinstein, A. R. (1987a) Clinimetric perspectives. J. Chron. Dis., 40, 635640.
  • ——(
    1987b) Clinimetrics. New Haven: Yale University Press.
  • 24
    Gill, T. M. (1995) Quality of life assessment: values and pitfalls. J. R. Soc. Med., 88, 680682.
  • 25
    Gill, T. M. and Feinstein, A. R. (1994) A critical appraisal of the quality of quality-of-life measurements. J. Am. Med. Ass., 272, 619626.
  • 26
    Guttman, L. (1944) A basis for scaling quantitative data. Am. Sociol. Rev., 9, 139150.
  • 27
    De Haes, J. C. J. M., Olschewski, M., Fayers, P. M., Visser, M. R. M., Cull, A., Hopwood, P. and Sanderman, R. (1996) The Rotterdam Symptom Checklist (RSCL): a Manual. Groningen: Northern Centre for Healthcare Research.
  • 28
    Haley, S. M., McHorney, C. A. and Ware, J. E. (1994) Evaluation of the MOS SF-36 physical functioning scale(PF-10): I, Unidimensionality and reproducibility of the Rasch item scale. J. Clin. Epidem., 47, 671684.
  • 29
    Hambleton, R. K., Swaminathan, H. and Rogers, H. J. (1991) Fundamentals of Item Response Theory. Thousand Oaks: Sage.
  • 30
    Hand, D. J. (1996) Statistics and the theory of measurement (with discussion). J. R. Statist. Soc. A, 159, 445492.
  • 31
    Hermanek, P. and Sobin, L. H. (eds) (1992) TNM Classification of Malignant Tumours: UICC International Union Against Cancer, 4th edn. Berlin: Springer.
  • 32
    Hickey, A. M., Bury, G., O'Boyle, C. A., Bradley, F., O'Kelly, F. D. and Shannon, W. (1996) A new short form individual quality of life measure (SEIQoL-DW): application in a cohort of individuals with HIV/AIDS. Br. Med. J., 213, 2933.
  • 33
    Holland, P. W. and Wainer, H. (eds) (1993) Differential Item Functioning. Hillsdale: Erlbaum.
  • 34
    Juniper, E. F., Guyatt, G. H., Streiner, D. L. and King, D. R. (1997) Clinical impact versus factor analysis for quality of life questionnaire construction. J. Clin. Epidem., 50, 233238.
  • 35
    Kaplan, R. M., Feeny, D. and Revicki, D. A. (1993) Methods for assessing relative importance in preference based outcome measures. Qual. Life Res., 2, 467475.
  • 36
    Likert, R. A. (1952) A technique for the development of attitude scales. Educ. Psychol. Measmnt, 12, 313315.
  • 37
    Lord, F. M. and Novick, M. R. (1968) Statistical Theories of Mental Test Scores. Reading: Addison-Wesley.
  • 38
    Marx, R. G., Bombardier, C., Hogg-Johnson, S. and Wright, J. G. (1999) Clinimetric and psychometric strategies for development of a health measurement scale. J. Clin. Epidem., 52, 105111.
  • 39
    McHorney, C. A., Haley, S. M. and Ware, J. E. (1997) Evaluation of the MOS SF-36 physical functioning scale (PF-10): II, Comparison of relative precision using Likert and Rasch scoring methods. J. Clin. Epidem., 50, 451461.
  • 40
    McIver, J. P. and Carmines, E. G. (1981) Unidimensional Scaling. London: Sage.
  • 41
    McKenna, S. P., Hunt, S. M. and McEwen, J. (1981) Weighting the seriousness of perceived health problems using Thurstone's method of paired comparisons. Int. J. Epidem., 10, 9397.
  • 42
    Melzack, R. and Wall, P. D. (1982) Acute pain in an emergency clinic: latency of onset and descriptor patterns related to different injuries. Pain, 14, 3334.
  • 43
    Molzahn, A. E., Northcott, H. C. and Hayduck, L. (1996) Quality of life of patients with end stage renal disease: a structural equation model. Qual. Life Res., 5, 426432.
  • 44
    Montazeri, A., Milroy, R., Gillis, C. R. and McEwen, J. (1996) Quality of life: perception of lung cancer patients. Eur. J. Cancer A, 32, 22842289.
  • 45
    Nunnally, J. C. and Bernstein, L. H. (1994) Psychometric Theory, 3rd edn. New York: McGraw-Hill.
  • 46
    O'Boyle, C. A., McGee, H. M., Hickey, A. M., O'Malley, K. M. and Joyce, C. R. B. (1992) Individual quality of life in patients undergoing hip replacement. Lancet, 339, 10881091.
  • 47
    Olsen, J. (1993) Some consequences of adopting a conditional deterministic causal model in epidemiology. Eur. J. Publ. Hlth, 3, 204209.
  • 48
    Prieto, L., Alonson, J., Viladrich, M. C. and Anto, J. M. (1996) Scaling the Spanish version of the Nottingham Health Profile: evidence of limited value of item weights. J. Clin. Epidem., 49, 3138.
  • 49
    Rackham, H. (1926) Nichomachean Ethics, book 1, part iv. London: Heinemann.
  • 50
    Raczek, A. E., Ware, J. E., Bjorner, J. B., Gandek, B., Haley, S. M., Aaronson, N., Apolone, G., Bech, P., Brazier, J. E., Bullinger, M. and Sullivan, M. (1998) Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: results from the IQOLA project. J. Clin. Epidem., 51, 12031214.
  • 51
    Romney, D. M. and Evans, D. R. (1996) Toward a general model of health-related quality of life. Qual. Life Res., 5, 235241.
  • 52
    Romney, D. M., Jenkins, C. D. and Bynner, J. M. (1992) A structural analysis of health related quality of life dimensions. Hum. Relns, 45, 165175.
  • 53
    Rothman, K. J. (1976) Causes. Am. J. Epidem., 104, 587592.
  • ——(
    1986) Modern Epidemiology. Boston: Little, Brown.
  • 55
    Ryan, M. and Farrar, S. (2000) Using conjoint analysis to elicit preferences for health care. Br. Med. J., 320, 15301533.
  • 56
    Spearman, C. (1904) General intelligence objectively determined and measured. Am. J. Psychol., 15, 201293.
  • 57
    Streiner, D. L. and Norman, G. R. (1995) Health Measurement Scales: a Practical Guide to Their Development and Use, 2nd edn. Oxford: Oxford University Press.
  • 58
    Stucki, G., Daltroy, L., Katz, J. N., Johannesson, M. and Liang, M. H. (1996) Interpretation of change scores in ordinal clinical scales and health status measures: the whole may not equal the sum of the parts. J. Clin. Epidem., 49, 711717.
  • 59
    Torrance, G. W., Feeny, D. H., Furlong, W. J., Barr, R. D., Zhang, Y. and Wang, Q. (1996) Multiattribute utility function for a comprehensive health status classification system: Health Utilities Index Mark 2. Med. Care, 34, 702722.DOI: 10.1097/00005650-199607000-00004
  • 60
    Tugwell, P., Bombardier, C., Buchanon, W. W., Goldsmith, C. H., Grace, E. and Hanna, B. (1987) The MACTAR patient preference disability questionnaire—an individualised functional priority approach for assessing improvement in physical disability in clinical trials in rheumatoid arthritis. J. Rheum., 14, 446451.
  • 61
    Wainer, H. (1976) Estimating coefficients in linear models: it don't make no nevermind. Pyschol. Bull., 83, 213217.
  • 62
    Ware, Jr, J. E., Snow, K. K., Kosinski, M. and Gandek, B. (1993) SF-36 Health Survey Manual and Interpretation Guide. Boston: New England Medical Care Centre.
  • 63
    Wright, J. G. and Feinstein, A. R. (1992) A comparative contrast of clinimetric and psychometric methods for constructing indexes and rating-scales. J. Clin. Epidem., 45, 12011218.

Discussion on the paper by Fayers and Hand

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References

D. J. Bartholomew (Stoke Ash)

Statistics depends crucially on measurement yet statisticians have not been prominent among those actually constructing measures. This has largely been left to practitioners in, for example, psychology, economics and medicine. This paper is therefore particularly welcome because it brings a very important practical measurement problem into the statistical arena. As the authors make clear, this is an extremely complex and subtle problem and I shall have to simplify things greatly to make my point. Essentially, I wish to argue that, although the distinction between indicator and causal variables is fundamental, the former are more directly relevant to constructing scales of measurement.

An idealized version of the problem of measuring quality of life (QOL) is shown in Fig. 1 where z represents QOL; its assumed relationship to y (the indicators) and w (the causal variables) is indicated by the arrows. The region above the line represents the real world of observable quantities. Here we find both the causal variables and the indicators. The region below the line represents the world constructed by statisticians and it contains unknown parameters, latent variables, normal distributions and so forth. Statistical inference, in its most general terms, involves constructing the lower region from what we observe in the upper region.

image

Figure 1. . Assumed relationship between causal variables, indicators and quality of life

Download figure to PowerPoint

It is easy to see from Fig. 1 why factor analysis does not work when applied to a mixture of causal and indicator variables. In the factor model all arrows go upwards from z. Thus any correlation among the causal variables will be treated as arising from z whereas, in reality, it may not be related to the latent variable at all. We may therefore easily obtain `strange' factors emerging as the authors have found. Furthermore, the ys are also indicators of the ws acting indirectly through z. Even if there happens to be no correlation at all between the ws they will still appear in the factor scores for the first factor because of their correlation with the ys. (The situation is not unlike that in regression, where we often find that the weights carried by predictors in the regression equation do not match our judgment of their real importance.)

That being so we come to the author's question: how should we combine causal and indicator variables in a single measure? I shall adopt the perspective of the general approach to social measurement that is set out in Bartholomew (1996) and Bartholomew and Knott (1999). This argues that we must start from a statistical model embracing all observed and latent variables and then derive a measure by using the posterior distribution of the latent variable given the indicators. This is a natural approach for a statistician but closely parallels much of the traditional psychometric treatment that is mentioned in the paper.

Such a model must specify the joint distribution of y, z and w. In the absence of w this would be done using the conditional distribution of y given z and the prior distribution of z. The natural way to specify the effect of the causal variables on z would be through the conditional distribution of z given w. However, although the posterior distribution of z on y and w can be found from these three distributions there appears to be a problem with the scaling of z. This approach therefore poses problems for combining the indicators and causal variables into a single measure. But this difficulty can be circumvented if, as I have suggested, we regard the indicators as already taking account of the causal variables through the mediating effect of the latent variable.

This proposal is not as abstruse as the formal approach may suggest; it is quite familiar in other fields. Consider, for example, driving ability. Many causal variables contribute to a person's driving ability—age, eyesight, response times, physique, knowledge of the Highway Code etc.—none of these are caused by driving ability. Their effect on whether one passes a driving test is real but no direct account is taken of them. All that counts is how the driver actually performs when tested. Would those who want to include causal variables in measures of QOL also wish to include them in the driving test? I deduce from what the authors say that this conclusion will not be palatable to some working in the medical field. Nevertheless, it is where the logic of the problem seems to lead.

The paper is full of interesting ideas and perceptive observations. I hope that the authors will continue their collaboration which, I am sure, will lead to greater understanding between statisticians and those at the sharp end of empirical research. I have much pleasure in proposing a very warm vote of thanks.

David R. Jones (University of Leicester)

President, speakers, Fellows, guests: I would like to begin with the quotation

`There are difficulties inherent in almost all aspects of the definition, collection and analysis of data about quality of life in patients undergoing treatment … [but] … such data describe an important aspect of treatment and should not be ignored merely on the grounds of the practical and theoretical difficulties involved'.

Although it is not witty, I think that it is wise, but I should immediately declare an interest, since it is a product of a previous collaboration with Peter Fayers (Fayers and Jones, 1983)! He has wisely found a new partner or two since then and begun to tackle some of these difficulties.

This paper is valuable in providing us with a clear map of and lexicon for the nature of variables and scales, with particular relevance to the assessment of health-related quality of life (HRQOL). Psychometric and clinimetric approaches are distinguished, and the roles and implications of casual and indicator variables in the development and testing of instruments set out. Professor Bartholomew has expressed doubts about this information, but it will at least be widely agreed, I think, that a putative causal structure needs to be kept in mind throughout the process of the development and analysis of scales.

Rather less is in fact said by the authors about the implementation and use of the instruments, or about implications for analysis. I would thus like to consider briefly some implications for medical statisticians undertaking tasks other than the primary development of scales, using the author's terminology. These include the following.

(a) Selection of instruments for use in particular contexts: in this role we shall need to know how the scale was constructed originally, and in particular to reach a reasonably clear understanding of which variables were originally considered causal and which indicator variables, so that the relevance of the instrument in the new context, wherein both component cause and latent variables may have changed from the original, may be assessed. If this is to be achieved, the reporting of initial scale and instrument development needs to be clearer and more explicit than it often is in practice.

(b) Development and validation of modifications of instruments: the modification of existing instruments by the addition of items relevant to symptoms or other aspects of HRQOL specific to the new disease context in which it is proposed to use an existing instrument is common practice. The key need will be to identify sufficient component causal variables for this new aspect of HRQOL. Thereafter, the original scale will need to be understood, as above, so that, for example, causal and indicator variables are not unwittingly combined. Issues of the analysis of the modified scale, particularly of weighting, will also frequently arise.

(c) Analysis of HRQOL data: the choice of weights βj for causal factors may indeed be crucial. In principle, the use of weights based on individual patients' own preferences and utilities is very desirable, but it raises problems not only of elicitation but also of analysis and interpretation of heterogeneous composite measures, except possibly at an aggregated global HRQOL measure level. The resurrection of Blalock's approach as a preferable alternative to (weighted or unweighted) summation of items for a collection of causal variables is a valuable recommendation.

However, an important pragmatic reservation needs to be expressed. It will be apparent that, if the implications of the distinction between variables types are to be followed through, the identification of, say, causal variables is essential. I am somewhat disappointed, therefore, that ultimately `the seemingly naïve thought test' for distinguishing causal from indicator variables is the main proposal for this identification step, at least in HRQOL applications. If a test such as this has good properties, its simplicity is certainly no disadvantage, but both examples cited in the paper and our wider experience suggest that simply advocating that researchers ask themselves whether each variable is plausibly causal or not is not reliable. The alternative strategy of comparisons against answers to a global HRQOL question inevitably invites enquiry about the basis for validation of that question or the gain from using a more complex instrument. If the authors' formulation is to be effectively employed, more adequate operational implementations of the suggested strategies are essential.

So, not all the difficulties of HRQOL measurement initially noted have of course yet been resolved. None-the-less, this paper is a useful and stimulating contribution. I am, therefore, pleased to congratulate the authors on their paper, and to second the proposal of the vote of thanks.

The vote of thanks was passed by acclamation.

D. R. Cox (Nuffield College, Oxford)

The paper makes a valuable contribution to an important topic. It seems a pity, however, to use the word causal in the way done here; the word has such strong implications especially in a medical and epidemiological setting.

A key issue is that in, say, an educational context individual items have no intrinsic interest. An answer to a question on algebra is valuable only to throw light on the student's knowledge of algebra and to help to predict performance in related contexts. Most of the items in health-related quality-of-life studies are, though, of intrinsic concern, especially for the management of individual patients. Even for, say, the comparisons in a randomized trial, this raises the issue why introduce at all factor analysis, the closely associated Rasch model, latent variables and so on? Indeed if there were only two or three items in a particular dimension there would be no need to do so. If, however, there are many items, the factor analysis representation is a convenient way of expressing the ideas that we expect all the items in a dimension to move systematically in the same direction and that there will be positive correlation in the random variability. There is no necessity to think in terms of latent variables. It is, however, as the authors explicitly discuss, important to supplement such overall comparisons of treatment groups by checks that there are no items that are incompatible with any systematic difference found. The important recent work of Svend Kreiner on differential item response in the Rasch model, with its links to graphical Markov models, deserves special mention.

At one level, however, all that is needed is this. If there is, say, a difference in mean score favouring A over B in a particular dimension, look at the individual item mean differences to see whether any appear to favour B. Under some admittedly relatively strong assumptions, limits of error can be placed on that mean difference as some protection against choosing that item which, by chance, happens to appear most anomalous.

Niels Keiding (University of Copenhagen)

The authors have described many difficulties connected to combining `causal' and `indicator' variables into a single index. In the simplest situation there is one latent variable, quality of life (QOL), influenced by the causal variables and itself governing the distribution of the indicator variables. Professor Bartholomew, in proposing the vote of thanks, outlined this situation and suggested that it is in general irrelevant to include causal variables in an index capturing the latent variables: we need an expression for QOL per se and possible relationships with causal variables are for later inquiry.

Bartholomew's position is in close agreement with standard practice in observational epidemiology, where a considerable effort is being spent in characterizing variables as responses, exposures, confounders or intermediate variables (and, yes, sometimes they are both of the last two possibilities, giving rise to Robins's concept of time-dependent confounders). We also know from epidemiology that this distinction is in principle not a statistical problem, but a postulate by which the subject-matter researcher defines the point from which the data are viewed—and sometimes it is fruitful to conduct several analyses, with different postulated causal directions. The essential point of the intermediate variables is that we may never condition on these when assessing the effect of exposures, because this effect will then be mistakenly diluted.

The general advice to scale constructors on the basis of this culture would seem to be to consider carefully the causal structure from the subject-matter insight that is available and to restrict attention to constructing scales based on variables causally post QOL, called indicator variables by Fayers and Hand.

With this kind of general experience it is difficult to believe that it will ever be desirable (not to mention feasible) to delineate empirically the causal variables from the indicator variables, as Fayers and Hand seem to propose. Not that their proposals are very convincing: certainly Rothman's some- what overexposed component cause model is sometimes helpful, and then generalized additive models fit poorly, but this is not a characterizing feature of causal variables, which often do act additively on some scale.

J. L. Hutton (University of Warwick, Coventry)

I thank the authors for a thought-provoking paper. Elucidation of different approaches to quality of life is valuable. However, the variety of factors produced for the Rotterdam symptom checklist highlights an issue that was not addressed by them. It is important to know when in a study scales and items are selected for analysis and publication. If a variety of quality-of-life measurements are assessed, and the `most significant' are reported, any subsequent attempt at a systematic review and meta-analysis will be subject to severe bias (Hutton and Williamson, 2000; Hahn et al., 2000).

David Firth (Nuffield College, Oxford)

The Blalock formula

inline image

(Blalock (1982), page 105) for combining variables xj whose values are between 0 and 1 has what might be termed a `nil cancellation' property: a high score (close to 1) on, say, x1 results in a high value of y, regardless of the values of {x2,x3,…}. In some contexts this will be undesirable. The extent to which a high score on one variable can be mitigated by low scores on others will often need to be limited, but not nullified.

A generalization is

inline image

which includes the Blalock formula at one extreme (λ=∞) and yields the direct linear combination

inline image

in the limit as λ[RIGHTWARDS ARROW]0. One interpretation is that, if the xj are each uniformly distributed on (0, 1), each xj is transformed so as to have the truncated-at-unity exponential(λ) distribution before linear combination using weights {βj}. This yields a continuum of possible cancellation properties, ranging from nil cancellation (λ=∞) to fully `linear' cancellation (λ=0). A recent application was in the construction of indices of local level deprivation in England (Noble et al., 2000) for the Department of the Environment, Transport and the Regions. There the need for explicit weighting and a desire to limit cancellation—which in that context meant the cancellation of one kind of deprivation by lack of deprivations of other kinds—led to the use of a transformation as above with λ=100/23, before weighted combination of measures for six different `domains' of deprivation.

The following contributions were received in writing after the meeting.

Michael Dewey (Trent Institute for Health Services Research, Nottingham)

In their thought-provoking review Fayers and Hand wonder whether it might be possible or ethical to ask patients `Do you feel life is worth living?'. This question has been asked in community surveys of older people and since it predicts mortality (see Saz and Dewey (2001) for a review) clearly relates to quantity even if not quality of life. One might wonder whether at times it is ethical not to ask it.

Gabrielle Kelly (University College Dublin)

I congratulate the authors on their comprehensive coverage of their topic. I found their approach both innovative and useful.

It led me to reassess a paper that I recently co-authored—Rooney et al. (2001). There we compared two groups of drug users, one on a methadone maintenance programme and the other on a minimization of harm programme in relation to their quality of life (QOL) and psychological status. Here we were examining the type of group as being causal for QOL and psychological status. A minimization of harm programme has by its nature more chaotic clients than a methadone maintenance programme so any group differences may not be due to the services or treatment that they receive. However, the study is still worthwhile. QOL and psychological status were measured by indicator variables. For QOL the UK short form 36 scale health survey questionnaire (Jenkinson et al., 1996) was used, which is divided into scores on eight dimensions of daily life. The dimensions were summarized using effect size as described by Fayers and Hand. Clients on the minimization of harm programme had a poorer perception of their QOL than those in the methadone maintenance group. Psychological morbidity was measured using the hospital anxiety and depression scale (Zigmond et al., 1983). This scale is psychometric in nature. Some psychological morbidity was found in both groups though the minimization of harm group scored worse. The two groups also differed significantly in the severity of drug use. However, the study highlighted the fact that drug use (i.e. both groups) had an associated risk for psychological and physical comorbidity, which has major implications for the development of ser- vices and treatment programmes, provision of education programmes and in the planning of suicide preventive strategies. In addition the type of group is causal for QOL and psychological status and this indicates that services should be targeted to the group's needs. Thus in terms of intervention the identification of causal variables is crucial. This interpretation of our paper became clear after reading that of Fayers and Hand.

On a smaller point we preferred to use the term perception of QOL rather than QOL to avoid incon- sistencies.

S. C. Morton (RAND Statistics Group, Santa Monica)

The primary example in this stimulating paper brings to mind a parallel, and I would venture equally as difficult, measurement task encountered in health services research. Fayers and Hand focus on measuring quality of life (QOL). A similar challenge is evaluating the quality of care (QOC) offered by health care providers.

Most QOC assessment is based on Donabedian's (1980) model of structure, process and outcome. To paraphrase Brook et al. (1996), structural data encompass physician and hospital characteristics, process variables describe the encounter between the physician and patient, and outcome data address the patient's subsequent health status. Process variables may be most easily collected, perhaps retrospectively from medical records, and thus are often used for an evaluation of QOC. An example might be `do asthmatic patients cared for at this hospital receive a yearly flu shot?'. Structural data are used in concert with process variables; perhaps the hospital organizational system facilitates the delivery of inoculation. In contrast, collecting outcome data such as hospitalization due to a severe episode of asthma brought on by flu might prove difficult and costly. Questions remain with either approach. For example, does a reliance on process measures increase the cost of medical care without an associated improvement in health? Or, can much of the variability in outcomes be explained by characteristics of the patient, which are beyond the reach of the health care system?

If process measures are to be effective as QOC criteria, then their relationship with outcomes must be demonstrated. Emphasis on evidence-based medicine (Sackett et al., 1997) has promoted establishing external models, e.g. by the analysis of prospective data, to use the terminology of Fayers and Hand. Alternatively, when an empirical model cannot be constructed to determine the predictive power of process on health outcomes, the relationship is motivated by a clinimetric approach, generally relying on expert judgment obtained via a group process (Jones and Hunter, 1995).

Several different process measures may be identified with which to assess the QOC delivered. Psychometric principles are often used to determine how to combine these different measures into a single scale. Interestingly, as Fayers and Hand point out has been done in the QOL setting, QOC researchers intuitively leave subscales, such as those containing diagnostic and therapeutic items, distinct.

The wide-ranging applicability of this paper makes it relevant to both QOC and QOL research.

Irini Moustaki (London School of Economics and Political Science)

I would like to draw attention to some latent variable models within the item response theory (IRT) framework that allow for causal effects. Sammel et al. (1997) discussed a factor analysis model for mixed binary and normal indicators that allows for causal effects on the latent variables and Moustaki (2002) has proposed an IRT model for ordinal indicators with causal effects. Furthermore, we can distinguish between covariates that are causal for the latent variables and covariates that account for the associations between the observed indicators together with the latent variables (direct effects).

Factor analysis models that allow for causal and direct effects may be defined as follows. Let us denote by a p×1 vector y the observed indicators which may be of any type (binary, nominal, ordinal or metric) and by a q×1 vector z the latent variables. We allow for a k×1 vector of observed covariates x that directly affect the observed indicators y and also for an s×1 vector w of covariates (causal variables) that affect the vector of latent variables z.

The model is fitted using an EM algorithm to obtain maximum likelihood estimators from the marginal distribution of the indicators. To apply the EM method we need the likelihood of the joint distribution of the observed indicators y and the latent variables z. For a random sample of size n the log-likelihood is written as

inline image

A detailed discussion of the estimation can be found in Sammel et al. (1997) and Moustaki (2002).

The assumption of conditional independence embedded in the model is that the vectors z and x account for all associations between the observed indicators y. The conditional distribution of g(yij | zj,xj) is allowed to take any form from the exponential family. The relationship between the latent variables z and the observed causal covariates w is expressed in a simple linear form:

inline image

where z is a q×1 vector, Γ is a q×s matrix of regression coefficients and δ is a q×1 vector of independent standard normal variables. Equation (2) allows for a shift in the location to the latent variables by Γw but the scale of the latent variables is fixed at 1 to allow for model identification.

In addition, for identifiability, variables x must be different from variables w.

E. M. Scott, A. M. Nolan, J. Reid and J. Fitzpatrick (University of Glasgow)

The principle of the development of a scale or index for quality of life has application beyond the human case and becomes potentially even more difficult when dealing with situations where observers must be used to rate the attribute. Our interest lies in the application of and extensions to human quality of life required to develop clinical pain scales and welfare indices successfully in both companion and farm animal settings. Such scales and indices are vital in the veterinary setting given the increasing public concern about animal welfare and the trend in the clinical setting towards controlled trials, e.g. for efficacy of analgesic drugs. The potential set of items under consideration includes behavioural, physiological and biochemical markers which are quite diverse, representing the different aspects of the phenomena under study. There is no standard and in our research we have explored internal scales by using a potential heterogeneous set of items. For the clinical acute pain setting, we have developed a questionnaire with eight categories (all behavioural, since in a previous study we had shown that the physiological parameters did not correlate with pain state (Holton et al., 1998) and a total of 37 items (Holton et al., 2001). All these items are indicator variables and we have used Thurstone's matched pairs approach to evaluate the weights for each. For the clinical chronic pain setting, we have used a questionnaire with over 100 items, with the owner completing the questionnaire by indicating degree of agreement with each item by using a seven-point Likert scale (Wiseman et al., 2001). For the welfare scale, the variables to be incorporated include husbandry, behavioural and physiological or biological measures (Scott et al., 2001). Thus only in the welfare setting have we both causal and indicator variables (and indeed the causal variables may be sufficient cause for poor welfare, e.g. the presence of disease or inadequate husbandry). The relative importance of the individual components is assessed from an elicitation of expert opinion. In each case, we aim to create a single composite measure but it is clear that the approaches taken to achieve this in the pain and welfare arenas must differ.

Validation (including reliability) is vital and extremely difficult to demonstrate; practical utility requires acceptable interobserver reliability or generalizability over observers. In our situation, the patient cannot give an assessment of their own state, so the use of other raters' is not `dubious' but is the only solution. Interestingly, there is a move to consider in some settings the use of self-selection (e.g. in providing pain relief), but this area still requires significant development before it becomes an accepted alternative to the use of raters in the veterinary setting.

The authors replied later, in writing, as follows.

We are grateful for the simulating comments made about our paper.

David Bartholomew's model is a nice encapsulation of the structure that we are promoting and, at first glance, his driving test analogy is a compelling illustration of why only the indicator variables should be included in the summary measure. However, this example is based on the assumption that the model is well specified and, in particular, that it includes all aspects of the latent variable among the indicator variables. Although this is an ideal towards which we should strive, we suggest that it will generally be optimistic to suppose that it has been achieved.

To illustrate this point, we continue with Bartholomew's driving test analogy and imagine someone, with weak eyesight and poor hand–eye co-ordination, both of which might be regarded as causal for driving ability. Let us begin with an extreme case, and suppose that nothing else has been measured—in particular, that no indicator variables have been measured. In these circumstances might we not decide that this person's driving ability would probably be poor—and perhaps construct a clinimetric driving ability by combining eyesight and co-ordination scores in some way?

Now let us relax this extreme example. Suppose that a driving test has also been administered, but that this test consists merely of the ability to reverse into a parking slot—so that the score on this test is an indicator variable. It seems clear that this test, although providing additional information about driving ability, does not subsume all the information contained in the two `causal' variables of eyesight and co-ordination. A more elaborate driving test will begin to cover such information—but how do we decide that it has all been covered? How do we decide that the causal variables contribute nothing beyond the information provided in the test (indicator) variables? We would need to be very confident that this was the case before attributing identical driving abilities to two people who had identical driving test scores but different causal variable scores.

Moreover, as the contributions of both Professor Scott, Dr Nolan, Dr Reid and Dr Fitzpatrick and Dr Morton illustrate, situations do arise where it may be difficult to tap directly into indicator variables for the attribute being measured.

We entirely endorse David Jones's appeal for clearer reporting of the process and rationale underlying the development of scales. All too often, new scales are created with little explicit justification, validation or theoretical base. Progress in the theoretical understanding of and therapeutic attacks on medical problems coexists with advances in measurement technology, as it does in other areas of science. This sort of thing is clearly especially important when existing scales are adapted for special purposes. Making the choice of structure and weights explicit is essential if others are to evaluate the choices, and perhaps to adapt them for their own purposes.

As usual, David Cox puts his finger on a key feature of the ideas, and one that may not be readily apparent. A fundamental difference between quality-of-life (QOL) measurement and the phenomena measured in psychometric and educational situations is the importance of the individual items in the former, but the lack of interest in the individual items in the latter. This is the essence of clinimetric measurements, in that they seek to summarize multiple attributes by using a single index.

We agree with Niels Keiding that we need an expression for QOL per se, so that we can then investigate possible relationships with causal variables, but our aim in this paper is how to formulate that initial expression for QOL. Perhaps it was inappropriate, as David Cox suggests, for us to use the adjective `causal' to describe those aspects of QOL which are not indicators of it. Perhaps an alternative would be the word `definitional' rather than causal: if the score on such a variable is high, then there is a higher probability that the QOL score will be low.

We are grateful to David Firth for his generalization of the Blalock model and for the illustration of its use, and to Michael Dewey for his example of a situation where a direct question relating to QOL has been asked.

We made no pretence, in this paper, of answering all of the many challenging questions hinging around QOL measurement. Our aim was, rather, to stimulate discussion, and perhaps to move the debate forwards a little. It is therefore particularly pleasing that Gabrielle Kelly found our classification of the variables into two types of practical value.

We were delighted to see that similar issues have arisen in other domains, as Sally Morton points out in her summary of quality-of-care assessment. The process variables measured there, however, seem to be being used as alternatives or proxies for the outcomes. A remaining question is whether such process measures contribute anything to the overall quality of care beyond those aspects that are measured by outcomes, in situations when both can be measured.

Irini Moustaki's model is a valuable extension of ours, though at least in our application it, also, faces the problems of determining into which of the x-, y- and w-categories the manifest variables each fall.

Jane Hutton raises an important point in research studies which use multiple QOL measures, as opposed to clinical applications. We agree that this is an underrated source of potential bias.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Examples
  5. 3. Indicator variables and causal variables
  6. 4. Necessary and sufficient component causes
  7. 5. Psychometric and clinimetric approaches
  8. 6. Validating scales
  9. 7. Scoring
  10. 8. Choice of weights for causal variables
  11. 9. Combining causal and indicator variables
  12. 10. Identification of causal and indicator variables
  13. 11. Discussion
  14. References
  15. Discussion on the paper by Fayers and Hand
  16. References
  • 1
    Bartholomew, D. J. (1996) The Statistical Approach to Social Measurement. San Diego: Academic Press.
  • 2
    Bartholomew, D. J. and Knott, M. (1999) Latent Variable Models and Factor Analysis, 2nd edn. London: Arnold.
  • 3
    Blalock, H. M. (1982) Conceptualization and Measurement in the Social Sciences. Beverley Hills: Sage.
  • 4
    Brook, R. H., McGlynn, E. A. and Cleary, P. D. (1996) Quality of health care: Part 2, measuring quality of care. New Engl. J. Med., 335, 966970.
  • 5
    Donabedian, A. (1980) Explorations in Quality Assessment and Monitoring, vol. 1, The Definition of Quality and Approaches to Its Assessment. Ann Arbor: Health Administration Press.
  • 6
    Fayers, P. M. and Jones, D. R. (1983) Measuring and analysing quality of life in cancer clinical trials. Statist. Med., 2, 429446.
  • 7
    Hahn, S., Williamson, P. R., Hutton, J. L., Garner, P. and Flynn, V. (1983) Assessing the potential for bias in meta-analysis due to selective reporting of subgroup analyses within studies. Statist. Med., 19, 33253336.
  • 8
    Holton, L., Reid, J., Scott, E. M., Pawson, P. and Nolan, A. (2001) Development of a behaviour based scale to measure acute pain in dogs. Vet. Rec., 145, 525531.
  • 9
    Holton, L., Scott, E. M., Nolan, A., Reid, J. and Welsh, E. M. (1998) Investigation of the relationship between physiological factors and clinical pain in dogs scored using an NRS. J. Small Anim. Prac., 39, 469474.
  • 10
    Hutton, J. L. and Williamson, P. R. (2000) Bias in meta-analysis due to outcome variable selection within studies. Appl. Statist., 49, 359370.
  • 11
    Jenkinson, C., Layte, R., Wright, L. and Coulter, A. (1996) The UK SF-36: an Analysis and Interpretation Manual. A Guide to Health Status Measurement with Particular Reference to the Short Form 36 Health Survey. University of Oxford: Health Services Research Unit.
  • 12
    Jones, J. and Hunter, D. (1995) Consensus methods for medical and health services research. Br. Med. J., 311, 376380.
  • 13
    Moustaki, I. (2002) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. To be published.
  • 14
    Noble, M., Smith, G., Penhale, B., Wright, G., Dibben, C., Owen, T. and Lloyd, M. (2000) Measuring Multiple Deprivation at the Small Area Level: the Indices of Deprivation 2000. London: Department of the Environment, Transport and the Regions.
  • 15
    Rooney, S., Freyne, A., Kelly, G. and O'Connor, J. (2001) A measurement of the quality of life in two groups of drug users. Ir. J. Psychol. Med., to be published.
  • 16
    Sackett, D. L., Richardson, W. S., Rosenberg, W. and Haynes, R. B. (1997) Evidence-based Medicine: how to Practice and Teach EBM. New York: Churchill Livingstone.
  • 17
    Sammel, R. D., Ryan, L. M. and Legler, J. M. (1997) Latent variable models for mixed discrete and continuous outcomes. J. R. Statist. Soc. B, 59, 667678.
  • 18
    Saz, P. and Dewey, M. E. (2001) Depression, depressive symptoms and mortality in persons aged 65 and over living in the community: a systematic review of the literature. Int. J. Ger. Psychiatr., 16, 622630.
  • 19
    Scott, E. M., Nolan, A. M. and Fitzpatrick, J. L. (2001) Conceptual and methodological issues related to welfare assessment: a framework for measurement. Acta Agric. Scand., 30, 510.
  • 20
    Wiseman, M. L., Nolan, A. M., Reid, J. and Scott, E. M. (2001) Preliminary study on owner reported behaviour changes associated with chronic pain in dogs. Vet. Rec., 149, 423424.
  • 21
    Zigmond, A. S. and Snaith, R. P. (1983) Hospital anxiety and depression scale. Acta Psychiatr. Scand., 67, 361370.