• instrument psychometrics;
  • research methods;
  • systematic review;
  • meta-analysis

To assess the inter-rater reliability, validity, and inter-instrument agreement of the three quality rating instruments for observational studies. Inter-rater reliability, criterion validity, and inter-instrument reliability were assessed for three quality rating scales, the Downs and Black (D&B), Newcastle–Ottawa (NOS), and Scottish Intercollegiate Guidelines Network (SIGN), using a sample of 23 observational studies of musculoskeletal health outcomes. Inter-rater reliability for the D&B (Intraclass correlations [ICC] = 0.73; CI = 0.47–0.88) and NOS (ICC = 0.52; CI = 0.14–0.76) were moderate to good and was poor for the SIGN (κ = 0.09; CI = −0.22–0.40). The NOS was not statistically valid (p = 0.35), although the SIGN was statistically valid (p < 0.05) with medium to large effect sizes (f2 = 0.29–0.47). Inter-instrument agreement estimates were κ = 0.34, CI = 0.05–0.62 (D&B versus SIGN), κ = 0.26, CI = 0.00–0.52 (SIGN versus NOS), and κ = 0.43, CI = 0.09–0.78 (D&B versus NOS). Reliability and validity are quite variable across quality rating scales used in assessing observational studies in systematic reviews. Copyright © 2011 John Wiley & Sons, Ltd.