The relationship between self-report and performance-related measures: Questioning the content validity of timed tests

Authors

  • Paul W. Stratford,

    Corresponding author
    1. McMaster University, Hamilton, Ontario, Canada
    • School of Rehabilitation Science, Institute for Applied Health Sciences, McMaster University, 1400 Main Street West, Hamilton, ON, Canada L8S 1C7.
    Search for more papers by this author
  • Deborah Kennedy,

    1. McMaster University, Hamilton, Ontario, Sunnybrook and Women's College Health Sciences Centre, Toronto, and University of Toronto, Toronto, Ontario, Canada
    Search for more papers by this author
  • Sonia M. C. Pagura,

    1. University of Toronto, Toronto, Ontario, Canada, and Sunnybrook and Women's College Health Sciences Centre, CITY, Ontario, Canada
    Search for more papers by this author
  • Jeffrey D. Gollish

    1. Sunnybrook and Women's College Health Sciences Centre, CITY, and University of Toronto, Toronto, Ontario, Canada
    Search for more papers by this author

Abstract

Objective

To examine the determinants of the modest correlation between self-report and performance-related measures in patients with osteoarthritis of the hip or knee.

Methods

Measures included the Lower Extremity Functional Scale (LEFS), the self paced walk, timed up-and-go, and stair test. Each performance measure consisted of 3 domains: time, pain (visual analog scale), and exertion (Borg scale). Activity specificity was assessed by examining correlations between the LEFS with single activity and multiple activity time scores. Domain specificity was examined by comparing correlations between the LEFS and single and multiple domain scores. The impact of measurement error was considered.

Results

Increasing the number of activity time scores had no effect. Forming a composite performance score based on time, pain, and exertion substantially increased the correlation from 0.44 (composite timed score) to 0.59 (pooled domain and activity score) (P = 0.009).

Conclusion

Performance scores based on time alone appear to inadequately represent the breadth of health concepts associated with functional status.

INTRODUCTION

Choosing an appropriate functional outcome measure for either clinical practice or research investigations involves a number of decisions. Not only does a clinician have to choose measures with sound measurement properties but also, depending on the nature of the study, there is the associated dilemma of deciding which type of measure to use. These measures generally fall into 2 broad categories, self-report measures and performance measures (1).

A self-report measure asks the patient's opinion on selected items; a performance measure is one in which an individual is asked to perform a specific task that is evaluated in a standardized manner using predetermined criteria, such as counting repetitions or timing the activity (2). Although both methods are applied frequently by clinicians and researchers to assess patients' outcomes, a recurring conundrum is that of only a modest correlation between the results of these assessment methods (3–6). Suggestions for this magnitude of correlation include measurement error, the extent to which performance measures adequately sample the domain of interest, and the relationship between physical performance measures and the true demands associated with activities of daily living (1, 7).

Physical performance measures have often not been as thoroughly developed and tested, hence information on their measurement properties does not exist to the same extent as for self-report measures. Performance measures tend to assess only a single attribute from the domain of interest. For example, assessing a patient's lower extremity functional status on stairs does not provide a comprehensive evaluation of the many activities associated with functional status (e.g., ability to sit, roll over in bed, dress, drive a car, perform household or recreational activities) (8). In contrast, self-report measures are capable of evaluating a number of aspects of function in a single test, possess good content validity, and often have well-established measurement properties. The usefulness of these measures, however, has been challenged because the memory or judgment in those with impaired cognitive function, and the willingness and ability of subjects to answer accurately may be in question (2, 9).

Given the consistent modest correlation between time-based performance measures and self-report functional status measures, the goal of this study was to better understand the determinants of this relationship in patients with a diagnosis of osteoarthritis awaiting hip and knee arthroplasty. To address content specificity and the impact of measurement error, a number of hypotheses were formed using 3 timed performance measures (40-meter fast self paced walk [SPW], timed up-and-go [TUG], and 10-step stair test [ST]) and 1 self report measure (Lower Extremity Functional Scale [LEFS]). In addition, pain and exertion ratings were collected for each performance measure. The hypotheses were as follows:

Hypothesis 1. The LEFS assesses aspects of performance that are not restricted to the time to complete the activity. The rationale for this hypothesis is that the LEFS inquires about difficulty that patients may interpret broadly.

Hypothesis 2. The correlation between the LEFS and a composite performance score based on the sum of the 3 standardized time scores (SPW, TUG, and ST) will be greater than the correlation between the LEFS and the individual time components of each of the 3 tests. It was hypothesized that increasing the number of activities should increase the activity aspect of content validity and reduce the measurement error associated with the assessment (internal consistency is a function of the number of items composing a test).

Hypothesis 3. The correlation between the LEFS and a composite performance score based on the sum of the standardized time, pain, and exertion scores for the 3 physical performance tests will be greater than the correlation between the LEFS and time components alone for of each of the 3 tests. The rationale for this hypothesis is that by adding pain and exertion domains, the content validity of the performance battery of tests will be further enhanced.

Hypothesis 4. The correlation between the sum of the 3 LEFS walk item scores and a composite performance SPW score based on the sum of the standardized time, pain, and exertion walk components will be greater than the correlation between the sum of the LEFS walk items and the time component for the SPW test. The rationale is that by restricting the activity content, the impact of insufficient activity sampling will be substantially reduced.

PATIENTS AND METHODS

Patients.

The study sample represented consecutive patients fulfilling the eligibility criteria. Patients awaiting total hip or knee arthroplasty were eligible for this study if they were able to speak and comprehend written English, provided informed consent, and were diagnosed as having osteoarthritis of the hip or knee. Data were collected at a single point in time at patients' presurgical orientation visits. These visits were scheduled approximately 2 weeks prior to surgery. One hundred four patients participated in this study. Eleven patients were unable to perform the stair test and these patients were excluded from the analyses. Accordingly, data for 93 patients were included in the analyses.

Performance-related measures.

Each patient completed a timed SPW, TUG, and ST. Time was measured on a stopwatch to the nearest 1/100 of a second. Perceived pain using a 10-cm horizontal visual analog scale (10) and a rating of perceived exertion using Borg's modified scale (11) were recorded immediately following each of the performance measures. The anchors for each of these scales were as follows: pain, 0 for no pain to 10 for pain as bad as it can be; and exertion, 0 for nothing at all to 10 for maximal. Therefore, each performance test received 3 scores: the time to complete the test, the pain associated with the test, and the exertion experienced during the test.

The SPW test was applied first, the TUG was applied second because it posed less physical challenge, and the ST was applied third. Patients were given the opportunity to rest between tests as needed.

The guidelines for these performance measures have been reported previously on a similar patient population (12). In terms of the fast SPW, patients walked 2 lengths of a 20-meter indoor course in response to the instruction: “walk as quickly as you can without overexerting yourself.” To complete the TUG, patients were required to rise from a standard arm chair, walk at a safe and comfortable pace to a mark 3 meters away, then return to a sitting position in the chair. The ST required patients to ascend and descend 10 stairs (step height 20 cm) in their usual manner, at a safe and comfortable pace. Patients were able to use their walking aids during the administration of each of these measures.

Self-report functional status measure.

The LEFS (see Appendix A) is a 20-item condition-specific functional status measure applicable to a wide spectrum of patients with lower extremity conditions of musculoskeletal origin (13). Each LEFS item is scored on a Likert scale from 0 to 4 with higher scores representing higher functional levels, the maximum score being 80. Most patients can complete the LEFS unassisted in less than 3 minutes, and clinicians can score it without the use of computational aids in about 20 seconds.

The LEFS underwent extensive development to ensure content validity and sound psychometric properties (13). Several investigations have shown that the LEFS has high levels of internal consistency (0.96), test-retest reliability (sample-dependent estimates range from 0.85 to 0.94), convergent and discriminant validity, and sensitivity to change (8, 13). It has been validated on patients with osteoarthritis of the hip or knee who underwent total joint arthroplasty (8). Of additional interest to the present study is that the LEFS has 3 items concerning walking.

Statistical analyses

The mean and standard deviation were applied as measures of central tendency and dispersion for all continuous measures. To test the first hypothesis, we applied an exploratory factor analysis with varimax and oblique rotations. Variables included in the factor analysis were as follows: SPW time, SPW pain, SPW exertion, TUG time, TUG pain, TUG exertion, ST time, ST pain, ST exertion, and LEFS score.

Pearson's correlation coefficients were applied to the subsequent analyses. Two-sided 95% confidence intervals were calculated for all correlation coefficients. Because the correlation coefficients being compared were obtained on the same patient sample, we applied an analysis that took into account dependent observations (14). The critical 2-sided overall P value for rejecting the null hypothesis of no difference between correlation coefficients was set at the 0.05 probability level.

Data were pooled into a single score to explore hypotheses 2, 3, and 4. Pooled scores were obtained by first converting each measure's score to a common metric. This was accomplished by transforming each measure's observed score to a Z score. Thus, the transformed scores for each measure had a mean of 0 and a standard deviation of 1.

To test the second hypothesis, Z scores were summed for the 3 timed tests, and this value was correlated with the LEFS total score. The time scores for each of the 3 measures were also correlated with the LEFS total score.

The third hypothesis was tested by comparing the correlation between the LEFS and the sum of the 3 timed tests with the correlation between the LEFS and the sum of the time, pain, and exertion components for each of the 3 timed tests (i.e., a total of 9 components).

The fourth hypothesis further examined the extent to which content specificity affected the size of the various correlation coefficients of interest. This was examined by calculating the correlation between the LEFS 3 walk items and the performance measures.

Finally, to illustrate the extent to which measurement error potentially influenced the size of the various correlation coefficients, we calculated the theoretical maximum correlations assuming perfect reliability. This correction for attenuation was accomplished by dividing the observed correlation coefficient by the product of the square root of the reliabilities of the 2 measures being correlated (15). Coefficient alpha, a measure of internal consistency, served as the reliability coefficient. Coefficient alpha was calculated for all pooled performance-related scores, the LEFS total score, and the sum of the LEFS 3 walk items. For example, the observed correlation between the pooled SPW components and the LEFS walk items is 0.66. The internal consistency of the SPW components is 0.65 and the internal consistency of the LEFS walk items is 0.78. Thus, the corrected correlation between the sum of the 3 timed tests and the LEFS is 0.93 (i.e., 0.66/√0.65 × 0.78). The corrected correlation coefficients indicate the theoretical correlation adjusting for the reliability of the 2 measures being correlated. When the corrected coefficient is similar to the actual coefficient, the interpretation is that the magnitude of the actual coefficient cannot be attributed to poor reliability.

RESULTS

Table 1 displays a summary of the patients' characteristics, including sex, joint distribution, age, weight, and height. Table 2 provides the descriptive statistics for the performance-related measures. The mean LEFS score was 28.8 (s = 12.6).

Table 1. Patient characteristics*
 Total n = 93Female n = 46 (49%)Male n = 47 (51%)
  • *

    BMI = body mass index.

Age, years, mean ± SD63.2 ± 11.364.3 ± 11.962.3 ± 10.8
Weight, kg, mean ± SD85.7 ± 20.377.3 ± 20.193.8 ± 17.0
Height, meters, mean ± SD1.67 ± 0.101.60 ± 0.081.74 ± 0.05
BMI, mean ± SD30.6 ± 6.230.2 ± 7.330.9 ± 5.0
Hip n (%)42 (45)18 (43)24 (57)
Knee n (%)51 (55)28 (55)23 (45)
Table 2. Performance-related measures, mean ± SD*
 Time (seconds)Pain (cm)Exertion (cm)
  • *

    SPW = self paced walk; TUG = timed up-and-go; Stair = stair test.

SPW43.7 ± 10.83.0 ± 2.42.1 ± 1.5
TUG12.3 ± 5.32.6 ± 2.51.9 ± 1.5
Stair23.5 ± 14.13.3 ± 2.62.5 ± 1.5

The results of the varimax and oblique rotations provided similar findings. Table 3 reports the results from the exploratory factor analysis using a varimax rotation. Three factors that accounted for 78% of the variance were identified using an eigenvalue greater than 1 rule. The predominant factor themes appear to be time, pain, and exertion. The LEFS demonstrated factor complexity by loading on all 3 factors. This finding supports our first hypothesis stating that the LEFS assesses aspects of performance that are not restricted to the time to complete the activity.

Table 3. Factor analysis with varimax rotation*
 Factor 1Factor 2Factor 3
  • *

    Stair = stair test; TUG = timed up-and-go; SPW = self paced walk; LEFS = Lower Extremity Functional Scale.

Stair time0.930.100.13
TUG time0.880.300.06
SPW time0.870.150.22
SPW pain0.120.850.23
TUG pain0.160.080.01
Stair pain0.230.810.28
LEFS0.350.440.41
TUG exertion0.160.220.85
SPW exertion0.010.240.84
Stair exertion0.310.210.83

Table 4 provides the correlation coefficients between the performance-related measures and the LEFS total and LEFS walk items. This table also reports the coefficient alpha values used to calculate the corrected correlation coefficients. The correlation between the SPW, the highest of the 3 timed tests, and the composite performance score based on the sum of the standardized time scores for the 3 tests are identical (0.44). This finding does not support our second hypothesis.

Table 4. Correlation coefficients between performance measures and LEFS (95% CI)*
 LEFS (α = 0.92)LEFS walk items (α = 0.78)
  • *

    LEFS = Lower Extremity Functional Scale; 95% CI = 95% confidence interval; SPW = self paced walk; TUG = timed up-and-go; Stair = stair test.

  • Coefficient alpha.

SPW time0.44 (0.26–0.59)0.47 (0.29–0.61)
TUG time0.42 (0.24–0.57)0.37 (0.18–0.53)
Stair time0.37 (0.18–0.53)0.36 (0.17–0.53)
Pooled time: SPW, TUG, Stair  
 Observed0.44 (0.26–0.59)0.43 (0.25–0.58)
 Corrected (α = 0.92)0.480.51
Pooled all performance measures  
 Observed0.61 (0.46–0.72)0.64 (0.50–0.75)
 Corrected (α = 0.88)0.680.77
Pooled SPW: time, pain, exertion  
 Observed0.59 (0.44–0.71)0.66 (0.53–0.75)
 Corrected (α = 0.65)0.760.93

The correlation between the LEFS and a composite performance score based on the sum of the standardized time, pain, and exertion scores for the 3 tests (i.e., 9 components) is significantly greater (z = 2.61, P = 0.009) than the correlations between the LEFS and the individual time components for the walk test. This finding supports our third hypothesis.

The correlation between the sum of the 3 LEFS walk item scores and the composite SPW score based on the sum of the standardized time, pain, and exertion walk scores is significantly greater (z = 3.04, P = 0.002) than the correlation between the LEFS and individual time component for the SPW test. This finding supports our fourth hypothesis.

Inspection of the actual and corrected coefficients demonstrate the following: 1) a small gain when the time components of the 3 performance tests are correlated with the LEFS total score (0.44 to 0.48); 2) a modest gain when the 3 components of the performance tests are correlated with the LEFS (0.61 to 0.68); and 3) a substantial gain when the 3 components of the SPW are correlated with the sum of the 3 LEFS walk items (0.66 to 0.93). Because the corrected coefficients represent conceptual values, no statistical comparisons were made with the actual correlation coefficients.

DISCUSSION

Having accepted the consistent finding of a modest correlation between performance measures and self-report functional status measures, investigators have speculated on potential explanations for this finding. Measurement error and the notion that performance tasks represent simplifications of the demands associated with activities of daily life are popular suggestions for the modest correlation; however, there is a paucity of studies that formally investigate these claims (1). The intent of the current investigation was to examine these speculations in the context of a hypothesis testing study.

Typically, investigations of performance measures have restricted the attribute of interest to a single concept. For the lower extremity, the concept is often the time to complete the task; variations on this outcome include distance, speed, endurance, and strength. Our study employed a battery of 3 lower extremity activities reported frequently in the literature: SPW, TUG, and ST. Like previous studies, we found a modest relationship between the time component of these tests and the self-report functional status measure (3, 16, 17).

The next aspect of our study explored the extent to which measurement error in the form of compromised reliability influenced the magnitude of correlation between the timed tests and LEFS. The observed correlation between the LEFS and the composite score of the timed components of the 3 performance tests did not differ from the correlation between the SPW and the LEFS. Moreover, the corrected correlation between composite time score (0.48) and the LEFS did not increase substantially from the observed correlation (0.44). The principal reason for this small gain is the high reliability, as represented by coefficient alpha, of the composite performance time score (0.92) and the high reliability of the LEFS (0.92). Accordingly, measurement error in the form of compromised reliability did not play a prominent role in accounting for the modest correlation between the composite time score of the 3 performance measures and the LEFS.

The second aspect of our study examined the extent to which the modest correlation between performance tests and self-report functional status measures could be accounted for by the breadth of attributes being assessed by the performance tests. Part of the design for this aspect of the study overlapped with the reliability study. Specifically, increasing the number of timed activities from 1 (SPW) to 3 (SPW, TUG, ST) did not affect the size of the correlation. A potential explanation for this finding is that the 3 activities all had a strong underlying theme, perhaps that of ambulation, and that this theme was not diverse enough to adequately represent the extent of lower extremity functional status. Thus, although the composite performance time score had excellent reliability, it is possible that it did not possess a high level of content validity and sample comprehensively the domains that compromise lower extremity function.

In addition to investigating time, our study also examined the extent to which lower extremity functional status, as measured by the LEFS, was associated with pain and exertion. There were 3 aspects to this part of the study. First, we performed a factor analysis. The results demonstrated that the LEFS displayed factor complexity by loading on time, pain, and exertion. This finding is informative because the LEFS inquiries about difficulty and does not specifically identify any of these attributes. The second part of this aspect of the study examined the correlation between the LEFS and a composite score based on the 3 components (time, pain, and exertion) of the performance measures. This correlation (0.61) was significantly greater than the correlation with SPW (0.44) alone and the composite time score from the 3 tests (0.44). This suggests that by increasing the breadth of health concepts (i.e., time, pain, and exertion) associated with the performance score, greater correlation is achieved with the self-report measure. The final part of this aspect of inquiry focused on the activity of walking. The idea being that by restricting the content to a single activity, one could obtain a clearer picture of the association between the performance measure and the self-report measure without the worry of adequately sampling from breadth of activities associated with lower extremity functional status. The correlation between the SPW time and the sum of the LEFS 3 walk items (0.47) was not appreciably different from the correlation between the SPW time and the LEFS total score (0.44). However, the correlation between the LEFS walk items and the composite SPW score based on time, pain, and exertion (0.66) was significantly greater than the correlation between the LEFS walk items and the SPW time. Moreover, the corrected correlation coefficient of 0.93 suggests that the potential for a higher correlation exists if the reliability of the 2 measures can be increased.

The results from this study suggest several points. The first is that the time to complete a performance test does not appear to adequately capture the breadth of health concepts associated with a self-report functional status measure, even when the focus of the functional status measure is restricted to a single activity (e.g., walking). The second is that the specific phrasing associated with the self-report measure appears to influence the breadth of domains perceived as being relevant by the patient. For example, the LEFS inquires about difficulty, and the factor analysis indicated that patients interpret this as time to complete the task, pain, and exertion. Although patients may assign more meanings to the term difficulty, our study investigated only these 3. A third point is the finding that a battery of tests that draws strongly on the same concept, in our case ambulation, does not result in an appreciable gain in the correlation with a self report measure.

In terms of deciding which measure to select for clinical and research purposes, the final choice must be directed by the measure's measurement properties, desired outcome of interest, and the goal of the investigation. In the osteoarthritis population, depending on the measures used, studies can be found that support using self-report measures alone or in combination with physical performance measures (3, 16, 18–20). Self-report measures do offer an efficient and cost effective method of comprehensively sampling from the domain of interest. However, there are situations where the choice of a physical performance measure is preferable. For example, if the goal were to determine whether a patient is able to cross an intersection in the time allocated by a traffic signal, then a timed walk test would be the measure of interest. Clearly, other examples favoring a performance measure can be found in the field of rehabilitation. In these cases, using a self-report measure might not truly capture the degree of disability specific to the desired task.

There are several potential limitations associated with the current work. First, the study sample is specific to patients awaiting total joint replacement. Accordingly, the extent to which our findings are generalizable to patients with less severe osteoarthritis of the hip or knee is unknown. A second potential limitation is that one cannot rule out a carryover effect in the reporting of exertion and pain. However, we do not believe this is likely because the mean pain and exertion scores do not show a monotonic increase related to the order of testing, but rather demonstrate a hierarchy consistent with the physiologic difficulty associated with the activities.

In summary, the results of our cross-sectional study suggest 2 reasons for the modest correlation between performance measures and self-report functional status measures: the reliance on time alone and insufficient sampling from the domain of potential concepts (e.g., pain and exertion) associated with functional status; and not achieving a high reliability when there is an adequate representation from the activity domain.

Acknowledgements

The authors thank Julie Richardson PT, MSc, PhD (candidate) for her suggestions concerning an earlier version of this manuscript.

APPENDIX A

LOWER EXTREMITY FUNCTIONAL SCALE

Table  . We are interested in knowing whether you are having any difficulty at all with the activities listed below because of your lower limb problem for which you are currently seeking attention. Please provide an answer for each activity. %Today,do you or would you have any difficulty at all with:
ACTIVITIES(Circle one number on each line)
Extreme difficulty or unable to perform activityQuite a bit of difficultyModerate difficultyA little bit of difficultyNo difficulty
a. Any of your usual work, housework or school activities.01234
b. Your usual hobbies, recreational or sporting activities.01234
c. Getting into or out of the bath.01234
d. Walking between rooms.01234
e. Putting on your shoes or socks.01234
f. Squatting.01234
g. Lifting an object, like a bag of groceries from the floor.01234
h. Performing light activities around your home.01234
i. Performing heavy activities around your home.01234
j. Getting into or out of a car.01234
k. Walking 2 blocks.01234
l. Walking a mile.01234
m. Going up or down 10 stairs (about 1 flight of stairs).01234
n. Standing for 1 hour.01234
o. Sitting for 1 hour.01234
p. Running on even ground.01234
q. Running on uneven ground.01234
r. Making sharp turns while running fast.01234
s. Hopping.01234
t. Rolling over in bed.01234
Column Totals:
    Score:_____/80

Ancillary