Responsiveness of five outcome measurement instruments in total elbow arthroplasty




To quantify and compare the sensitivity to change of 5 outcome instruments for the elbow joint.


In a prospective cohort study (n = 65), outcome was measured by the Short Form 36 (SF-36), the Disabilities of the Arm, Shoulder, and Hand questionnaire (DASH), the modified American Shoulder and Elbow Surgeons questionnaire for the elbow, patient part (pmASES) and examiner/clinical part (cmASES), and the Patient-Rated Elbow Evaluation form (PREE). Responsiveness was quantified by the effect size (ES) and standardized response mean (SRM) before and 6 months after total elbow arthroplasty. Receiver operating characteristic (ROC) curves were used to determine the instruments' ability to classify effects into global health change assessment categories.


For the total scores, the ES were 1.50 for the PREE, 1.32 for the pmASES, 0.86 for the cmASES, 0.56 for the DASH, and 0.11 for the SF-36 (P ≤ 0.002 for all differences, except the cmASES and DASH). The same order was found within the subdomains of pain/symptoms and function and remained consistent when using the SRM and in ROC curve analysis. PREE total (area under the ROC curve 0.68), DASH function, and pmASES total and pain (area under the ROC curve range 0.64–0.67) discriminated best between “much better” and the other categories.


The PREE was the most responsive instrument and can be recommended for every set of measures for elbow joint disorders. The pmASES was slightly less responsive but is a valid alternative. The examiner-assessed cmASES is affected by concerns regarding validity and was relatively less responsive. The DASH for comprehensive measurement of the entire upper extremity and the SF-36 for chronic pain conditions complete the assessment set.


Outcome measurement before and after interventions has become a matter of course in clinical reality, and development of standardized instruments for this purpose has reached a peak in the past 2 decades. In particular, self-assessment and its validity have become increasingly important, since the patient's self-perceived symptoms and disabilities are the driving force that leads to medical interventions and costs (1–3). This concept has been incorporated into the World Health Organization's International Classification of Functioning, Disability and Health and the recommendations of the leading rheumatology associations (4, 5).

The properties of patient-rated outcomes are well defined and have to be tested in a standardized way (6, 7). One of the key properties in measurement of patients' health status (e.g., before and after interventions) is the sensitivity to change, i.e., the responsiveness that is part of the validity for longitudinal measurement (7). Although there are many data on responsiveness for the shoulder and, to a lesser extent, for the wrist/hand, few studies have examined the elbow in the context of upper extremity disorders (3, 8). Data sampling for elbow diseases is much more time consuming than for the shoulder and wrist because the elbow is more rarely affected by late-stage arthritis (3, 9). Cross-sectional validity was examined for the following 5 instruments in our previous study (9). Longitudinal validity to which responsiveness is sometimes attributed has been examined for the shoulder joint in our previous study (10) and should be determined for the elbow joint in this analog study.

This study aimed to examine and compare the responsiveness of the Short Form 36 (SF-36), the Disabilities of the Arm, Shoulder, and Hand questionnaire (DASH), the modified American Shoulder and Elbow Surgeons questionnaire for the elbow, patient-rated part (pmASES) and examiner-rated, clinical part (cmASES; as a contrast to the self-report instruments), and the Patient-Rated Elbow Evaluation form (PREE) in patients undergoing elbow replacement surgery. Together with previous findings on feasibility, reliability, and cross-sectional validity, this should help to characterize instruments that 1) allow standardized comprehensive functional staging as well as sensitive and specific measurement and comparison of patient groups, 2) provide benchmarks for the patients' benefit, and 3) can be used to monitor the course of individual patients.

Significance & Innovations

  • This is the first report comparing 5 well-established elbow outcome instruments with a focus on their responsiveness in a longitudinal study.

  • The elbow-specific Patient-Rated Elbow Evaluation form and patient-modified American Shoulder and Elbow Surgeons (mASES) questionnaire for the elbow were most responsive, followed by the examiner-based, elbow-specific mASES, the arm-specific Disabilities of the Arm, Shoulder, and Hand questionnaire, and the generic Short Form 36.

  • This dose (i.e., joint specificity)-response relationship that was heuristically expected has been proven by empirical data. Conclusions and consequences for use of the tools in the clinic and in research are discussed.



A consecutive patient cohort was recruited from those attending an ongoing outcome assessment at the Department of Upper Extremity and Hand Surgery, Schulthess Klinik, Zurich, Switzerland. Inclusion criteria were: 1) primary unilateral total elbow arthroplasty; revision and second contralateral arthroplasties were excluded; 2) ability to complete self-assessment questionnaires, especially sufficient German language, and psychointellectual abilities, i.e., absence of a preventive severe somatic or psychiatric illness (e.g., terminal cancer, dementia); and 3) written informed consent. Excluded also were fresh fractures in order to obtain a homogeneous cohort. The study's protocol was approved by the local ethics committee of the Canton Zurich, Switzerland.


The SF-36 is the most widespread generic instrument and comprehensively measures physical, mental, and psychosocial health by 36 Likert-scaled items composing 8 scales and 2 component summary scales (11, 12). We used version 1.0. The DASH is the most widely used region-specific tool for the upper extremity and measures symptoms (6 items, of which 3 are about pain) and mainly function (24 items) (13, 14). Originally, the DASH was described as a 1-dimensional instrument (total scale by 30 items, all scaled on 5 Likert levels). The division into subscales for symptoms and function is obvious when looking at the items' content and has been examined in our previous study (9). However, psychometric testing of the concept using the 2 subscales is ongoing (13). The joint-specific mASES for the elbow consists of a patient self-rated part and an examiner-based part (15, 16). The joint-specific pmASES consists of 3 scales: a scale each for pain (5 visual analog scale [VAS] items), function (12 Likert-scaled items with 4 levels), and satisfaction (1 VAS scale) (15, 16). The examiner-rated cmASES has 5 subscores and a total score and was previously described in detail (9). The elbow-specific PREE has 5 items for pain and 15 for function (10 for specific, 5 for usual activities), scaled on an 11-point (range 0–10) numerical rating scale (17, 18). For the pmASES and the PREE, the average of the pain and function scores forms the total scores. An overview of the domains and the (sub)scales of the instruments is shown in Table 1. Of all of the instruments, the validated German versions were used. Further detailed descriptions of all of the instruments with references to the original literature can be found in the study by Angst et al (9).

Table 1. Descriptive and responsiveness data*
 NBaseline, mean ± SDFollowup, mean ± SDDifference, mean ± SDESSRMMIDReliability, ICCAUC (95% CI)SensitivitySpecificity
  • *

    Scores range from 0–100 for all scales, where 0 = worst and 100 = no pain/best function/health. N = number of complete data sets; ES = effect size; SRM = standardized response mean; MID = minimum important difference; ICC = intraclass correlation coefficient (10–18); AUC = area under the receiver operating characteristic curve; 95% CI = 95% confidence interval; SF-36 = Short Form 36; PCS = physical component summary; MCS = mental component summary; DASH = Disabilities of the Arm, Shoulder, and Hand questionnaire; NA = not available; PREE = Patient-Rated Elbow Evaluation form; pmASES = patient-modified American Shoulder and Elbow Surgeons questionnaire for the elbow; cmASES = clinical modified ASES.

  • Significant.

 Physical functioning6448.7 ± 23.950.4 ± 24.81.7 ± (0.32–0.57)0.280.80
 Role physical6134.8 ± 40.937.7 ± 39.22.9 ± (0.36–0.61)0.260.93
 Bodily pain6236.8 ± 23.750.5 ± 24.213.7 ± 27.30.580.500.410.360.830.42 (0.29–0.56)0.200.88
 General health6266.0 ± 22.264.4 ± 22.9−1.6 ± 16.3−0.07−0.100.460.630.790.41 (0.27–0.55)0.180.76
 Vitality6251.8 ± 20.658.0 ± 21.36.2 ± 17.10.300.360.350.420.880.41 (0.29–0.53)0.240.76
 Social functioning6377.0 ± 27.083.3 ± 24.06.3 ± (0.47–0.76)0.820.47
 Role emotional5881.6 ± 35.482.2 ± 36.50.6 ± (0.36–0.67)0.860.25
 Mental health6271.4 ± 18.377.1 ± 16.95.7 ± 16.80.310.340.320.340.900.61 (0.46–0.75)0.780.47
 PCS5733.0 ± 9.034.0 ± 10.31.0 ± (0.30–0.55)0.260.87
 MCS5754.5 ± 10.856.7 ± 10.32.2 ± (0.41–0.74)0.860.47
 Symptoms6352.7 ± 20.668.7 ± 22.416.0 ± 22.80.770.70NANANA0.46 (0.32–0.60)0.960.18
 Function5943.1 ± 18.050.2 ± 19.57.1 ± 17.10.390.41NANANA0.65 (0.52–0.77)0.690.59
 Total6044.9 ± 15.953.8 ± 18.48.9 ± 16.20.560.550.200.200.960.61 (0.48–0.74)0.590.71
 Pain6339.2 ± 18.470.5 ± 27.431.3 ± 24.61.701.270.520.390.730.65 (0.51–0.78)0.670.71
 Function6339.1 ± 23.262.0 ± 25.122.9 ± 23.70.990.970.420.420.820.62 (0.48–0.75)0.830.47
 Total6339.2 ± 18.166.2 ± 23.127.1 ± 19.81.501.370.450.410.800.68 (0.57–0.81)0.630.71
 Pain6244.2 ± 18.172.3 ± 26.628.1 ± 24.41.551.150.320.230.900.64 (0.51–0.78)0.640.71
 Function6136.1 ± 24.554.9 ± 25.318.8 ± 25.00.770.750.360.350.870.62 (0.47–0.78)0.580.63
 Satisfaction63NA84.4 ± 22.0NANANA0.56NA0.69NANANA
 Total5940.5 ± 17.864.0 ± 21.923.5 ± 20.11.321. (0.53–0.80)0.650.69
 Symptoms6072.7 ± 21.392.0 ± 14.319.3 ± 22.60.910.86   0.51 (0.33–0.70)0.910.31
 Motion6055.4 ± 17.169.2 ± 11.213.9 ± 17.30.810.80   0.45 (0.32–0.59)0.370.69
 Stability6078.7 ± 32.396.9 ± 7.718.1 ± 31.30.560.58   0.60 (0.46–0.74)0.490.69
 Strength5978.2 ± 21.786.3 ± 25.48.1 ± 33.50.370.24   0.50 (0.34–0.68)0.410.76
 Grip strength, kg5512.0 ± 9.715.7 ± 8.83.7 ± 5.70.380.65   0.53 (0.39–0.68)0.460.73
 Total5461.2 ± 14.873.9 ± 6.912.7 ± 13.90.860.91   0.56 (0.40–0.72)0.920.33

Statistical analysis.

Assessments were performed on the day before arthroplasty, i.e., upon entry to the clinic just before arthroplasty (baseline), and 6 months later (followup). At both time points, a trained study nurse (SD) distributed and collected the self-assessment instruments as well as the form to self-rate the global health change (transition item; see below) and performed clinical examination for the cmASES. Data had to be completed according to the instrument's missing rules to determine each of the scores, i.e., at least 50% completed items for the SF-36, 90% for the DASH, and 67% (2 of 3) for all other scores (9). All of the scores were scaled from 0–100, where 0 = maximal pain/no function/worst health and 100 = no pain/full function/best health, to be comparable to each other (9). This is the original scoring of the SF-36. The original scores of the DASH, pmASES, and PREE (where 0 = best and 100 = worst) were reversed. All analyses were performed using the statistical software package SPSS for Windows, version 20.0.

The effect size (ES) equals the score difference between followup and baseline divided by the SD of the group's baseline scores as originally introduced as Glass's delta (19, 20). The score difference (followup − baseline) divided by the SD of the group's score differences determines the standardized response mean (SRM), originally described as Hedge's g for one sample (20). The ES and, to a lesser extent, the SRM are the most common parametric measures of responsiveness (6). The SRMs are shown because many studies report only the SRM and not the ES. A positive ES/SRM reflects (standardized) improvement of health or function in the number of SDs of baseline scores (for ES) or of the score difference between baseline and followup (for SRM). An ES ≥0.80 is regarded as large, 0.50–0.79 as moderate, and <0.50 as small (19).

The ES or SRM of 2 scales (i.e., of 2 different instruments) measuring the same construct domain (e.g., pain or function) within the same patient group was compared by the modified jackknife test to test their differences for statistical significance (10, 21). The significance level must be reduced by the number of tested scores (k), i.e., P = 0.050/(k!/[k − 2]! × 2!) in multiple pairwise testing of (at least partly) nonindependent scores (e.g., within the patient rating of pain). This is well known as the Bonferroni correction (22). Therefore, the significance level for the Type I error was P values of 0.050/10 (0.005) for the comparisons of the k = 5 tools. Construct overlap between the instruments was quantified by the nonparametric Spearman's correlation coefficients of the effects (the ES) (Table 2). The levels of the responsiveness measures should go parallel to the correlation levels. This means that a joint-specific score is supposed to show a high ES/SRM as well as a high correlation to another joint-specific measure. This is regarded as convergent construct validity and, in contrast, low correlation and low ES/SRM as divergent construct validity (6).

Table 2. Pairwise Spearman's correlation coefficients of the effect sizes between the total scores*
  • *

    SF-36 = Short Form 36; PCS = physical component summary; MCS = mental component summary; DASH = Disabilities of the Arm, Shoulder, and Hand questionnaire; PREE = Patient-Rated Elbow Evaluation form; pmASES = patient-modified American Shoulder and Elbow Surgeons questionnaire for the elbow; cmASES = clinical modified ASES.

  • Correlation coefficient >0.00 with P < 0.050.

SF-36 MCS−0.28    

Another comparative rating of ES is provided by comparison to minimum clinically important differences (MCIDs) (23), i.e., the difference that can be, on average, subjectively perceived as improvement (or deterioration) by the patients. The MCID can be estimated by the calculated minimum important difference (MID), for which several concepts exist (23–25). One is given by the square root of 1 − reliability in units of ES (10, 23). As a reliability measure, intraclass correlation coefficients (ICCs) were obtained by reviewing the literature (Table 1). A responsive instrument should measure score changes, i.e., effects (ES) larger than MIDs.

Sample size was estimated using the Student's t-test as an approximation of the modified jackknife test. The t-test is the special case if the “centered” ES equals zero (y-axis intercept) in the regression of the modified jackknife test (10, 21). In this case, the t-test tends to overestimate the sample size and gives a conservative estimate. Empirical score SDs of 23–28 points of the scale of 0–100 (9) revealed a necessary sample size of 60–65 cases for a statistically significant difference of an a priori ES difference of 0.30. This value was chosen because it is in the middle of the a priori calculated MIDs, i.e., a difference with clinical importance (Table 1).

Another assessment of responsiveness is provided by the so-called sensitivity analysis, i.e., the ability to differentiate global health assessment categories (6, 7, 10). The patients had to rate their global health change (at followup compared to that at baseline) with respect to the affected elbow as a response to the so-called “transition” question as an external criterion (25). The possible response categories were “much worse,” “slightly worse,” “equal,” “slightly better,” and “much better.” If an instrument corresponds to the supposed construct (elbow joint affection), it should measure better health by higher scores. This means that the tool (e.g., the PREE), on average, has to measure higher improvements for patients who have assessed themselves as more highly improved than the other subjects as rated by the transition question. Due to the number of responses within the transition item categories (see Results), the scores of the patients having responded much better were compared to those of the 4 other categories (slightly better, equal, slightly worse, or much worse). Data of these 4 categories were collapsed into a single category to obtain a sufficiently high number of responses. This ability to specify was tested by sensitivity analysis of the receiver operating characteristic (ROC) curve (10, 24). The ROC curve is the plot of sensitivity (true-positive rate) against 1 − specificity (specificity = false-positive rate). An area under the ROC curve (AUC) of 1.00 represents perfect differentiation by the score with 100% sensitivity and 100% specificity; an AUC of 0.50 means no ability of the model to differentiate, i.e., no better than chance; an AUC of 0.70 is usually considered moderate; and an AUC of 0.80 is high and indicates that the instrument classifies and performs well.


Patient demographics.

Seventy-nine patients who had received total elbow arthroplasty between January 2005 and April 2011 were recruited at baseline. Between baseline and the 6-month followup, 1 patient died, 5 could not be traced because of severe illness or moving abroad, and 2 refused further participation, and 2 elbows had to be revised and 4 had a fresh fracture (exclusion criteria). Therefore, 65 patients (82%) with unilateral, primary elbow joint arthroplasty were available for final examination.

Age ranged from 34.4–87.2 years (mean ± SD 61.9 ± 13.0 years), and 46 (71%) were women. Rheumatoid arthritis (RA) was the disease underlying joint destruction in 35 cases (54%), primary or secondary (mostly after old fracture), and osteoarthritis was in 30 cases (46%). The right elbow was affected in 37 patients (57%). A Gschwend-Scheier-Bähler III endoprosthesis (Zimmer) was implanted in 56 joints (86%); 6 (9%) received a Coonrad/Morrey endoprosthesis (Zimmer) and 3 (5%) received a Discovery endoprosthesis (Biomet).

Responsiveness analysis.

Complete descriptive data provided by the baseline and followup scores, together with the ES and the SRM, are shown in Table 1 to provide an overview of health and quality of life over the observation period. The number of patients indicates the number of complete baseline and followup data sets according to the missing rules of the questionnaires (see Methods).

Preoperatively (at baseline), patients reported considerable pain/symptoms (mean scores between 36.8 and 52.7, where 0 = maximal pain/symptoms), whereas the examiner-rated symptoms were lower on the cmASES (mean score 72.7, where 100 = best). Function had deteriorated on all of the scales (mean range 36.1–48.7, cmASES mean 55.4, where 0 = no function). The mental and psychosocial scales of the SF-36 showed relatively good well-being. At the 6-month followup, satisfaction with the intervention was high (pmASES mean 84.4, where 100 = best).

All pain/symptoms, function, and total scores were around the middle or in the lower half of the possible scale (range 0–100), which allowed determination of the effect data without major floor or ceiling problems. At followup, low effects (ES <0.50) showed up on the SF-36 scales (except SF-36 bodily pain), DASH function, and cmASES strength and grip strength. High effects (ES ≥0.80) were observed on all PREE and pmASES scales and on the cmASES symptoms, motion, and total scores. All other effects were moderate. All high and moderate effects, i.e., those on all PREE and pmASES scales, the DASH total scale, and the SF-36 bodily pain scale, exceed the estimates for the MID (Table 1); all other SF-36 effects did not. All of these phenomena remained consistent when using the SRMs.

Construct convergence and divergence were quantified by the correlations of the ES of the total scores and are shown in Table 2. The corresponding data of the baseline scores can be found in a previously published cross-sectional study (9). Elbow joint specificity is expected to be highest on the PREE and pmASES, moderate on the DASH, low on the SF-36 physical component summary (PCS) score, and almost zero on the SF-36 mental component summary (MCS) score, reflecting construct divergence. For example, taking the PREE as a reference, the order of the correlation levels reflected quite well the amount of elbow joint specificity: the PREE correlated with itself by 1.00, to the pmASES by 0.93, to the cmASES by 0.47, to the DASH by 0.54, to the SF-36 PCS by 0.14, and to the SF-36 MCS by 0.15. The examiner-rated cmASES correlated moderately with the elbow-specific self-assessments, the PREE, the pmASES, and the SF-36 MCS. It was compared to these because no other standardized examiner-based tool was available. Among the subscales of the cmASES, motion correlated best with the corresponding pmASES function, and was therefore chosen for comparison of the construct of function (r = 0.33; other data not shown in detail).

Effect comparison of the scores within the 3 construct domains is shown in Figure 1. The instruments are listed by level of ES. Even after the Bonferroni correction (Type I error P < 0.005 is necessary; see Methods), pairwise modified jackknife testing of the ES between the scores showed highly statistical differences (all P ≤ 0.002; data not shown in detail), with the following exceptions: SF-36 versus DASH (P = 0.180) and DASH versus cmASES (P = 0.348) in pain/symptoms. In function, the differences between the SF-36 and the DASH (P = 0.009) and all differences of the cmASES to the joint-specific tools (to DASH P = 0.007, PREE P = 0.208, and pmASES P = 0.690) were not significant. This may be because the differences of the ES of the cmASES compared to the 3 tools had a high variance that was possibly caused by low construct overlap. For the total scores, the DASH versus the cmASES (P = 0.020) was the only nonsignificant comparison.

Figure 1.

Effect comparison within the 3 constructs. ES = effect size; PREE = Patient-Rated Elbow Evaluation form; pmASES = patient-modified American Shoulder and Elbow Surgeons questionnaire for the elbow; cmASES = clinical ASES; DASH = Disabilities of the Arm, Shoulder, and Hand questionnaire; SF-36 = Short Form 36.

This means that the PREE was significantly more responsive than the pmASES, cmASES, DASH, and SF-36 in all 3 construct domains (pain, function, and total) except when compared to the cmASES in function (Figure 1). The pmASES showed significantly higher responsiveness than the cmASES, DASH, and SF-36 in all 3 constructs except when compared to the cmASES in function. The cmASES was significantly more responsive than the SF-36 in the function and total scores, but its higher responsiveness reached non–statistical significance when compared to the DASH in all 3 domains. The DASH total score was significantly more responsive than the SF-36 PCS.

Sensitivity analysis.

At followup 6 months after arthroplasty, health was rated as much better in 47 cases (72%). The categories slightly better (13 cases), equal (0 cases), slightly worse (1 case), and much worse (3 cases) were collapsed into 1 group (17 cases [26%]) to obtain sufficient size for comparison by ROC curve analysis (1 response to the transition item was missing).

The PREE total score discriminated best between much better and the other categories with an AUC of 0.68, which means an almost moderate ability to differentiate (Table 1). It was followed by the pmASES total (AUC 0.67), DASH function (AUC 0.65), PREE pain (AUC 0.65), and pmASES pain (AUC 0.64) scores. The 95% confidence intervals of the other scores covered 0.50, which means that their ability to differentiate was not statistically better than by chance. The PREE function and DASH total scores just missed these criteria.


We tested the ability of 5 elbow outcome measurement instruments to identify and quantify changes of health before and after 65 unilateral, primary total elbow arthroplasties. The PREE was the most responsive instrument in pain, function, and the total score, followed by the pmASES, the cmASES, the DASH, and the SF-36. Accordingly, responsiveness correlated with elbow joint specificity as follows: the elbow-specific PREE and pmASES were followed by the examiner-based, elbow-specific cmASES, the arm-specific DASH, and the generic SF-36. This dose (i.e., joint specificity)–response relationship was heuristically expected and now has been proven by empirical data.

To our best knowledge, only 2 longitudinal studies exist that have examined the responsiveness of elbow instruments, but none of them used the PREE or the mASES (26, 27). The first compared the Hospital for Special Surgery (HSS) form, the Mayo Clinic Elbow Performance Index (MEPI), and the Elbow Function Assessment (EFA) scale in the strata of 18 improved and 6 unchanged patients 10 years ago (26). The EFA scale was the most responsive, followed by the MEPI and the HSS form. Although published data on the EFA scale and the HSS form remained sparse in the following years, the MEPI was used in several studies. In the selection of instruments for our original setting, the MEPI was not included because all MEPI items are covered by the DASH and the PREE, and the binary response options (present/absent) of the MEPI are likely to cause psychometric problems, as analogously shown in shoulder instruments (8). Additionally, a cross-sectional study stated that “both the DASH and the mASES performed as well or better than the elbow-scoring systems” (including the MEPI) (9, 28). The second longitudinal elbow study concluded that the 12-item Oxford Elbow Score, introduced in 2008, was more responsive than the DASH (27). Unfortunately, the authors did not compare the ES using statistical tests.

Based on our present data, we recommend the PREE as the most responsive measurement of elbow-specific symptoms and function. It may be preferred to the pmASES, although the administrative burden is slightly higher because it has 15 function items (the mASES has 12 items), whereby both tools have 5 pain items each. The cross-sectional (r = 0.92) and longitudinal (r = 0.93) construct overlap between the 2 tools was very high (10). In our cross-sectional factor analysis, the PREE loaded slightly higher on the factor “physical specific” (r = 0.81) than the pmASES (r = 0.77) (10). The loads were 0.43 (PREE) and 0.55 (pmASES) on “physical unspecific.” Together with the present responsiveness data, this is consistent with the statement that the PREE is somewhat more elbow specific than the pmASES. A disadvantage of the PREE was the relatively low test–retest reliability (see ICCs in Table 1) of the German version when compared to the pmASES, especially in pain. This affects the precision of measurement on the individual level. Normative data obtained by population surveys are lacking for both tools.

If examiner-based data are desired, the cmASES is the only known option for a standardized assessment. Its responsiveness was significantly lower than that of the PREE and the pmASES, but slightly (although not significantly) higher than that of the DASH. The cmASES motion subscale reflects most closely the construct of function. It is composed of 6 range of motion items (flexion, extension, flex–extension arc, pronation, supination, and prosupination arc) and is easy to measure with high interexaminer reliability (presumably because there are no empirical data). In contrast, the cmASES symptoms subscale consists of 18 items that are suspected of causing important clinimetric problems (9, 15). For example, the “cubital tunnel stretch test” is expected to vary widely because examination is difficult and, consequently, it is highly examiner dependent. This sign and the sign “impingement pain in flexion/extension” as well as various tenderness signs are impossible to prove in fresh elbow fractures or painful terminal osteoarthritis. Many other signs, e.g., grip strength, depend on the affection of the elbow adjoining the wrist and/or shoulder as is typical in RA, and are not elbow specific. The cmASES symptoms score was affected by high ceiling effects in the postoperative assessment (9). In summary, the cmASES can only be recommended with substantial reservations regarding its validity.

In systemic diseases such as RA or chronic pain due to polyosteoarthritis, the arm-specific DASH and the holistic SF-36 should complete the set of measurement instruments. Both are best tested on psychometric properties and, furthermore, data on many other conditions exist as well as normative population data that allow comparison with our own results as already outlined for shoulder conditions (10). Consistent with comparisons in shoulder arthroplasty, the DASH function subscale showed high discrimination ability in the sensitivity analysis using the ROC curve over the transition item, which may be based on the relatively high number of 24 items (10). The gap between the relatively high responsiveness of the DASH symptoms subscale versus the function subscale (ES 0.77 versus 0.39) may be caused by the small number of elbow-specific function items of the DASH. This is consistent with the statement that “the 24 function items of the DASH thus seemed not to be particularly sensitive to elbow-specific disabilities” in the cross-sectional study (9).

The SF-36 physical functioning subscale asks about ambulation in 6 of 10 items, and only 2 items ask about arm-related function (bathing/dressing, lifting or carrying groceries). This unspecific assessment was not able to detect, on average, functional improvement after elbow arthroplasty (ES 0.07), whereas the 2 pain items measured a moderate mean effect (ES 0.58). In addition, the mental and psychosocial SF-36 scales are important in chronic pain syndrome, which may also include elbow pain (29). Our sample was not affected by substantial problems in these domains because the pre- and postoperative MCS mean scores were 54.5 and 56.7, respectively (US population norm 50.0, where higher scores indicate better mental health). In contrast to the SF-36 subscales, the PCS and MCS scales need cautious interpretation due to their construct as a linear combination with positive (PCS physical scales) and negative (PCS psychosocial scales) coefficients (10). A patient with identical baseline and followup scores on the 4 physical scales but improvement on (some of) the 4 mental/psychosocial scales will show deterioration on the PCS (and analogously vice versa in the MCS): a paradoxical result.

One weakness of this study is the limited generalizability of the findings because of the specific setting and because other comparative studies are lacking. The results are valid for osteoarthritis and RA treated by primary total elbow arthroplasty. They may be different for epicondylitis humeri, for example. Second, sensitivity analysis was only able to test between the categories much better and all other categories due to the small numbers of patients in the other categories. A strength of the study is the fact that it is the first longitudinal comparison of psychometric properties of 5 of the most often used measurement instruments for elbow conditions. The sample size is relatively high, given that elbow arthroplasties are rarely performed, and it is high enough to detect a difference of 2 ES of 0.30, which is an average clinically meaningful difference.

In conclusion, only the PREE can be recommended as a minimal, short, but elbow joint–specific assessment set. The pmASES is a valid alternative to the PREE. The DASH should be added for polyarticular diseases such as RA and the SF-36 for chronic pain conditions. The cmASES will provide standardized clinical measurement with some weaknesses.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Angst had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Angst, Goldhahn, Aeschlimann, Simmen, Schwyzer.

Acquisition of data. Angst, Goldhahn, Drerup, Kolling, Simmen, Schwyzer.

Analysis and interpretation of data. Angst, Goldhahn, Drerup, Simmen.


The authors thank all of the patients for their participation in the study, Franziska Kohler for continuing data acquisition, and Joy Buchanan for her English editing.