Reproducibility, validity, and responsiveness of the hip outcome score in patients with end-stage hip osteoarthritis




To evaluate reproducibility, validity, and responsiveness of the Hip Outcome Score (HOS) in patients with end-stage hip osteoarthritis.


In a cohort of 157 consecutive patients (mean age 66 years; 79 women) undergoing total hip replacement, the HOS was tested for the following measurement properties: feasibility (percentage of evaluable questionnaires), reproducibility (intraclass correlation coefficient [ICC] and standard error of measurement [SEM]), construct validity (correlation with the Western Ontario and McMaster Universities Osteoarthritis Index [WOMAC], Oxford Hip Score [OHS], Short Form 12 health survey, and University of California, Los Angeles activity scale), internal consistency (Cronbach's alpha), factorial validity (factor analysis), floor and ceiling effects, and internal and external responsiveness at 6 months after surgery (standardized response mean and change score correlations).


Missing items occurred frequently. Five percent to 6% of the HOS activities of daily living (ADL) subscales and 20–32% of the sport subscales could not be scored. ICCs were 0.92 for both subscales. SEMs were 1.8 points (ADL subscale) and 2.3 points (sport subscale). Highest correlations were found with the OHS (r = 0.81 for ADL subscale and r = 0.58 for sport subscale) and the WOMAC physical function subscale (r = 0.83 for ADL subscale and r = 0.56 for sport subscale). Cronbach's alpha was 0.93 and 0.88 for the ADL and sport subscales, respectively. Neither unidimensionality of the subscales nor the 2-factor structure was supported by factor analysis. Both subscales showed good internal and external responsiveness.


The HOS is reproducible and responsive when assessing patients with end-stage hip osteoarthritis in whom the items are relevant. However, based on the large proportion of missing data and the findings of the factor analysis, we cannot recommend this questionnaire for routine use in this target group.


Hip osteoarthritis (OA) is one of the most common causes of disabling pain and activity limitation in the general adult population and represents a major economic burden (1). Total hip arthroplasty (THA) is the treatment of choice for end-stage OA and results in a significantly improved quality of life (2). To assess and compare treatment outcomes, patient-oriented measures are commonly used (3). The disease-specific Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) questionnaire is the most widely used instrument to evaluate pain and function in patients with hip OA and has been in use since 1988 (4). However, in view of ongoing refinements in surgical techniques and implants and the trend for an increasing number of younger patients undergoing THA, expectations of surgical outcomes have changed over the past 20 years (5–7). This raises the question of whether the old “gold standard” questionnaires, such as the WOMAC, are still the most relevant and useful for today's THA patients.

Many different measures have been developed in recent years, with the Hip Outcome Score (HOS) being one of these measures (8–11). The HOS was developed and validated for patients with labral lesions and femoroacetabular impingement. It comprises 1 subscale related to more demanding physical functioning and sports ability. Any questionnaire used to determine the functional status of patients should be relevant, reliable, valid, and responsive in the given target group (12). Considering the continuum of care provided for hip pathologies ranging from hip-preserving surgery, such as hip arthroscopy, to revision hip arthroplasty, it would be desirable to have a questionnaire that possesses the necessary measurement properties in a broad range of patient groups. This would be advantageous when assessing and comparing outcomes between different patient populations and treatment regimens, and would also reduce the administrative burden for a given center. We therefore questioned whether the HOS would be valuable for patients with advanced hip OA and determined the questionnaire's reproducibility, validity, and responsiveness in this target group.

Significance & Innovations

  • First-time comprehensive performance testing of a modern hip-specific questionnaire (Hip Outcome Score [HOS]) in patients with end-stage osteoarthritis (OA).

  • The questionnaire was reproducible and responsive in patients who were able to complete the required number of items.

  • Neither unidimensionality of the 2 subscales nor the 2-factor structure of the HOS could be confirmed; there were a lot of missing data for the sport subscale.

  • Based on these findings, we do not consider the HOS to be suitable for routine use in patients with end-stage hip OA.

Materials and methods

Patient cohort and study design.

This prospective observational study (Schulthess Clinic, Zurich, Switzerland) included 157 consecutive patients (mean ± SD age 65.7 ± 1.8 years; n = 79 females) with end-stage hip OA (clear radiographic and clinical evidence of OA, and having exhausted conservative measures) scheduled for THA. Inclusion criteria were a good understanding of the German language and a willingness to participate. A set of questionnaires accompanied by an explanatory letter was sent to the patients 1 week before admission for surgery. Patients were asked to complete the questionnaire set and return it on the day of admission. A subset of 43 patients volunteered to complete the HOS a second time before the day of surgery for reproducibility assessment (i.e., test–retest 1 week apart). This sample size was calculated according to Walter et al (13), setting α = 0.05, β = 0.20, and assuming an acceptable intraclass correlation coefficient (ICC) of 0.80 with an expected ICC >0.90 (10, 11). Six months after THA, the HOS was completed again for responsiveness assessment. The study was approved by the local ethical committee and all patients provided written informed consent to participate.

Outcome tools.

The questionnaire set consisted of the German versions of the HOS, WOMAC, Oxford Hip Score (OHS), Short Form 12 (SF-12) health survey, and the University of California, Los Angeles (UCLA) activity scale. The HOS comprises 2 separately scored subscales: the activities of daily living (ADL) subscale with 19 items and the sport subscale with 9 items (8–11). Subscale scores can range from 0–100, with higher scores representing better function. Details on data sampling and scoring have been published previously (8–11). According to the developers, questionnaires can be scored if at least 14 items have been completed on the ADL subscale and 7 completed on the sport subscale (8).

Construct validity.

The association between the HOS and the other questionnaires was determined by Pearson's correlation coefficients. According to Cohen (14), correlation coefficients can be considered small (r = <0.30), moderate (r = 0.31–0.50), or large (r = >0.50). If 2 instruments are measuring the same/similar attributes, correlation coefficients should be between 0.4 and 0.8 (15). Based on these assumptions and as an indicator of convergent validity, we expected large correlations (r = >0.50) between the HOS ADL subscale and each of the following: WOMAC subscales, OHS, SF-12 physical component scale (PCS), and UCLA activity scale. Apart from the UCLA activity scale, none of the aforementioned instruments specifically inquires about sport activities, and we therefore expected only moderate correlations (r = >0.4) between the HOS sport subscale and the WOMAC subscales, the OHS, and the SF-12 PCS. The correlation between the HOS sport subscale and the UCLA activity scale was expected to be large (r = >0.50). We expected a low correlation (r = <0.30) between both the HOS subscales and the SF-12 mental component scale (divergent validity).

Reproducibility and internal consistency.

Reproducibility was determined using the ICC (two-way mixed, single-measure model: ICC3,1) and agreement using the standard error of measurement (SEM) (12). We expected ICC values ≥0.90 for each subscale (10, 11). Internal consistency was determined using Cronbach's alpha (12). We expected Cronbach's alpha values ≥0.85 for each subscale (12).

Factorial validity.

Structural validity of the HOS was analyzed using confirmatory factor analysis (CFA) with maximum likelihood estimation. Based on previous investigations and results (8, 9), we tested a 2-factor structure (ADL and sport) with the 2 factors being intercorrelated, each observed variably loading on 1 factor only, and with the errors of measurement associated with each observable variable uncorrelated. Model fit was evaluated according to the Comparative Fit Index (CFI; values >0.96 indicating good fit), Tucker-Lewis Index (TLI; values >0.96 indicating good fit), standardized root mean square residual (SRMR; values <0.08 indicating good fit), and root mean square error of approximation (RMSEA; values <0.07 indicating good fit) (16, 17). We tested the unidimensionality of each factor separately using exploratory factor analysis (EFA). EFA was conducted using the maximum likelihood extraction method with oblique rotation (direct oblimin), and the number of components was determined using the scree test on the sedimentation graph and Kaiser's criterion.

Floor and ceiling effects.

Floor and ceiling effects were calculated as the proportion of patients with worst and best values for the instrument. The “worst” value was defined as the actual worst end-anchor score (0 points) plus the minimum detectable change (MDC) and vice versa for the definition of the “best” value (18). We expected a proportion of floor and ceiling effects <15% for both subscales (12).


Internal responsiveness (6 months) was given by the standardized response mean (SRM) (19). We expected SRM values ≥1.0 for each subscale (10, 19). External responsiveness was calculated by correlating the change scores (pre- to postoperatively) of the HOS subscales with those of the reference questionnaires. We expected large correlations (r = >0.50).

Statistical analysis.

All statistical analyses were performed using SPSS software, version 17, with the exception of CFA and EFA, which were performed with AMOS (SPSS). For all tests, P values less than 0.05 were considered to be statistically significant. All data are presented as the mean ± SD unless otherwise stated. We only used fully completed questionnaires for the analysis; forms with missing data were excluded.



Missing items occurred frequently (Table 1). Using published missing data rules (8–11), no score could be calculated for the ADL subscale in 8 (5%) and 10 (6%) questionnaires, and for the sport subscale in 32 (20%) and 50 (32%) questionnaires, at baseline and 6 months, respectively. All the other questionnaires had completion rates from 98–100%, both at baseline and followup.

Table 1. Missing items and items marked as not applicable for the ADL and sport subscales of the HOS*
HOS subscaleBaseline, no. (%) of total items6 months, no. (%) of total items
  • *

    ADL = activities of daily living; HOS = Hip Outcome Score.

ADL total items (n = 2,983)  
 ADL marked not applicable42 (1)122 (4)
 ADL left blank67 (2)21 (1)
Sport total items (n = 1,413)  
 Sport marked not applicable208 (14)329 (23)
 Sport left blank52 (4)38 (3)


ICCs were 0.92 (95% confidence interval [95% CI] 0.86, 0.96) for the ADL subscale and 0.92 (95% CI 0.83, 0.96) for the sport subscale. The SEM was 1.8 points (95% CI −2.1, 2.0 points) for the ADL subscale and 2.3 points (95% CI −1.1, 5.0 points) for the sport subscale. This gave MDCs for the ADL and sport subscales of 5.1 and 6.4 points, respectively.

Internal consistency and factorial validity.

Internal consistency was confirmed by Cronbach's alpha values of 0.93 (ADL subscale) and 0.88 (sport subscale). Removal of any given item did not improve the Cronbach's alpha values of the subscales.

Using preoperative HOS data, the CFA for the 2-factor structure showed a CFI of 0.76 and a TLI of 0.74, each of which was below the cutoff value for a good fit. The SRMR was 0.082 and the RMSEA was 0.10, each of which was above the maximum recommended values for an acceptable fit. Overall, the goodness-of-fit statistics indicated that the observed data showed a borderline acceptable fit (SRMR) or poor fit (CFI, TLI, and RMSEA) to the 2-factor structure (Table 2). In the original publication, items 3 (putting on socks and shoes) and 11 (sitting for 15 minutes) were excluded from the final questionnaire (8). We also performed the CFA omitting these 2 items, but it did not change the results substantially (0.78 for CFI, 0.76 for TLI, 0.078 for SRMR, and 0.10 for RMSEA). The 6-month followup data showed an acceptable fit to the hypothesized model according to the SRMR (0.071), but still not according to the other goodness-of-fit parameters (0.82 for CFI, 0.81 for TLE, and 0.12 for RMSEA). Excluding items 3 and 11 resulted in largely similar results.

Table 2. Confirmatory factor analysis with standardized loading matrix of the 28 HOS items at baseline and 6 months after surgery*
HOS itemPreoperatively6-months followup
Factor 1Factor 2Factor 1Factor 2
  • *

    Factor 1 corresponds to the activities of daily living subscale, and factor 2 corresponds to the sport subscale. HOS = Hip Outcome Score.

Standing for 15 minutes0.58 0.74 
Getting into/out of an average car0.60 0.71 
Putting on socks and shoes0.48 0.70 
Walking up steep hills0.73 0.82 
Walking down steep hills0.71 0.80 
Going up 1 flight of stairs0.76 0.85 
Going down 1 flight of stairs0.69 0.83 
Going up and down curbs0.66 0.77 
Deep squatting0.55 0.76 
Getting into/out of a bath0.62 0.71 
Sitting for 15 minutes0.52 0.67 
Walking initially0.44 0.71 
Walking for ∼10 minutes0.44 0.61 
Walking for ≥15 minutes0.69 0.81 
Twisting/pivoting on involved leg0.62 0.78 
Rolling over in bed0.55 0.76 
Light to moderate work (standing/walking)0.73 0.89 
Heavy work (pushing/pulling, climbing, carrying)0.73 0.85 
Recreational activities0.64 0.81 
Running 1 mile 0.67 0.83
Jumping 0.69 0.8
Swinging objects (e.g., golf club) 0.56 0.74
Landing 0.39 0.79
Starting and stopping quickly 0.73 0.88
Cutting/lateral movements 0.59 0.85
Low-impact activities (e.g., fast walking) 0.76 0.83
Ability to perform activity with your normal technique 0.71 0.76
Ability to participate in your desired sport as long as you would like 0.56 0.76
Intercorrelation between factors0.750.76

Using EFA on each scale separately revealed 4 factors, explaining 56% of variance (χ2 = 197.6, P < 0.001) in the preoperative ADL data. For the preoperative sport data, 2 factors were extracted, explaining 49% of variance (χ2 = 35.7, P < 0.01). In the 6-month data, 2 factors were extracted for ADL, explaining 65% of variance (χ2 = 390.8, P < 0.001), whereas unidimensionality was confirmed for the sport scale (1 factor explaining 66% of variance; χ2 = 169.5, P < 0.001).


Correlation coefficients were determined using the 147 ADL subscales and 125 sport subscales for which a score could be calculated. Each of the HOS subscales correlated to the expected extent with the other instruments except for the relationship between the HOS sport and WOMAC stiffness subscales, which was lower than hypothesized (Table 3).

Table 3. Correlation coefficients between the HOS ADL and sport subscales and the scale and subscales of the other instruments used*
QuestionnaireHOS ADL, r (95% CI)HOS sport, r (95% CI)
  • *

    All correlations are significant at P < 0.005. HOS = Hip Outcome Score; ADL = activities of daily living; 95% CI = 95% confidence interval; WOMAC = Western Ontario and McMaster Universities Osteoarthritis Index; OHS = Oxford Hip Score; SF-12 = Short Form 12 health survey; PCS = physical component scale; MCS = mental component scale; UCLA = University of California, Los Angeles.

WOMAC pain0.68 (0.58, 0.76)0.46 (0.30, 0.59)
WOMAC stiffness0.45 (0.31, 0.57)0.29 (0.12, 0.44)
WOMAC function0.83 (0.77, 0.86)0.56 (0.43, 0.67)
WOMAC total score0.81 (0.75, 0.86)0.56 (0.43, 0.67)
OHS0.81 (0.75, 0.86)0.58 (0.45, 0.69)
SF-12 PCS0.61 (0.50, 0.70)0.58 (0.45, 0.68)
SF-12 MCS0.29 (0.13, 0.43)0.29 (0.12, 0.44)
UCLA activity0.61 (0.50, 0.70)0.54 (0.40, 0.65)
HOS ADL1.00.73 (0.63, 0.80)

Results of floor and ceiling effects.

At baseline, there were no floor or ceiling effects for the ADL subscale, but there was a floor effect of 20.5% for the sport subscale. After 6 months, ceiling effects were seen in 31.3% of ADL subscales (13.9% if the MDC was not taken into account) and 13.9% of sport subscales (10.1% if the MDC was not taken into account). Floor effects occurred in 4.4% and 2.2% after 6 months, respectively.

Internal and external responsiveness.

Both subscales were highly responsive with SRMs of 1.9 (ADL) and 1.5 (sport). High internal responsiveness was also found for the OHS, the SF-12 PCS, and the WOMAC and its subscales (Figure 1). External responsiveness was proven by large correlations between the change scores of the HOS subscales and those of the other questionnaires, ranging from r = 0.54 (HOS sport and SF-12 PCS) to r = 0.73 (HOS ADL and WOMAC total score).

Figure 1.

Internal responsiveness for all measures used, determined 6 months after surgery. HOS = Hip Outcome Score; ADL = activities of daily living; UCLA = University of California, Los Angeles; SF-12 = Short Form 12 health survey; MCS = mental component scale; PCS = physical component scale; OHS = Oxford Hip Score; WOMAC = Western Ontario and McMaster Universities Osteoarthritis Index.


Using classical test theory for examining the psychometric properties, this study investigated whether the HOS, a modern questionnaire recently developed and validated for patients with labral tears or femoroacetabular impingement (FAI), might be also valuable for patients with end-stage hip OA.

The present results offer evidence that both HOS subscales (ADL and sport) are highly reproducible, internally consistent, and responsive to change in those patients who are able to complete the requisite number of items. The ICC and Cronbach's alpha values found are in line with those reported by the developers and those found for the German version, which has been recently validated in FAI patients (10, 11). SRM values >1 indicate high responsiveness and, therefore, both subscales can be considered very responsive (19). When investigating patients with end-stage hip OA, the WOMAC and the OHS can be considered as disease-specific and joint-specific gold standards, respectively. The HOS ADL subscale was moderately to highly correlated with these measures and the coefficients of r = 0.45–0.83 compare well with values reported for the association between the OHS and the WOMAC itself (20). In contrast, the sport subscale showed weaker associations with the WOMAC subscales and the OHS (r = 0.36–0.60). The ability to participate in sports-related activities and in ADLs represents different constructs, explaining the different behavior of the 2 subscales.

Most of the content of the OHS or WOMAC is also represented by the HOS ADL subscale items, but the items in the sport subscale are more unique. The lower correlations found between the scores of the HOS sport subscale and those of the OHS, WOMAC, and SF-12 PCS were as hypothesized. In contrast, the rather weak coefficient between the sport subscale (r = 0.51) and the UCLA activity scale was unexpected, considering that both the sport subscale and the UCLA scale are related to physical activities. Furthermore, the sport subscale showed an unacceptably high proportion of items marked “not applicable” or left blank, with the consequence that almost one-third of the questionnaires could not be scored. All other questionnaires used (OHS, WOMAC, SF-12, UCLA) had completion rates between 98% and 100%. The issue of missing items in the sport subscale has been recognized before (8, 11), although in these studies, which involved patients in their 30s, scoring of the subscale was still possible in up to 99%. Although some patients undergoing THA are still very active and participate in sports, the content of the sport subscale appears to be too sports-oriented for the general end-stage hip OA cohort. Accordingly, items 1 to 6, referring to activities such as “running one mile,” “jumping,” or “cutting movements” were frequently marked “not applicable” or were left blank. Although missing responses always occur to a certain degree in the clinical setting, we consider a proportion of >30% to be unacceptable for cross-sectional study purposes or for routine assessment.

A further problem concerns the structural validity of the HOS when used in OA patients. The questionnaire was designed to measure 2 different constructs (ADL and sport) (8). We tested this proposed 2-factor structure using CFA, and, preoperatively, the data generally did not show an acceptable fit to the hypothesized model. At 6 months, the CFA showed an improvement in some of the goodness-of-fit indices, and the EFA confirmed unidimensionality for the sport subscale. However, for the ADL subscale, 4 and 2 factors were extracted in the baseline and 6-months data, respectively. We therefore speculate that the 2-factor HOS addresses hip-related complaints of a less severe nature than those seen in end-stage OA, such as those displayed in THA patients 6 months after surgery or in patients experiencing FAI, for whom this questionnaire was originally developed. This should be investigated further.

Although the HOS offers evidence of reproducibility and responsiveness in those patients with advanced hip OA for whom the items are relevant, the large proportion of missing data occurring particularly in the sport subscale makes this questionnaire unfeasible for routine use in this target group. Additionally, the 2-factor structure proposed for FAI patients could not be confirmed with certainty in OA patients, especially using preoperative data. It might be that in patients with less severe hip OA, i.e., not undergoing THA, the HOS would perform better; however, this remains to be investigated in future studies.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Naal had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Naal, Impellizzeri, von Eisenhart-Rothe, Mannion, Leunig.

Acquisition of data. Naal, Impellizzeri.

Analysis and interpretation of data. Naal, Impellizzeri, von Eisenhart-Rothe, Mannion, Leunig.