Summary. We assessed the reliability and construct validity of the Compatible MRI scale for evaluation of elbows, and compared the diagnostic performance of MRI and radiographs for assessment of these joints. Twenty-nine MR examinations of elbows from 27 boys with haemophilia A and B [age range, 5–17 years (mean, 11.5)] were independently read by four blinded radiologists on two occasions. Three centres participated in the study: (Toronto, n = 24 examinations; Atlanta, n = 3; Cuiaba, n = 2). The number of previous joint bleeds and severity of haemophilia were reference standard measures. The inter-reader reliability of MRI scores was substantial (ICC = 0.73) for the additive (A)-scale and excellent (ICC = 0.83) for the progressive (P)-scale. The intrareader reliability was excellent for both P-scores (ICC = 0.91) and A-scores (ICC = 0.93). The total P- and A-scores correlated poorly (r = 0.36) or moderately (r = 0.54), but positively, with clinical-laboratory measurements. The total MRI scores demonstrated high accuracy for discrimination of presence or absence of arthropathy [P-scale, area-under-the-curve (AUC) = 0.94 ± 0.05; A-scale, AUC = 0.89 ± 0.06], as did the soft tissue scores of both scales (P-scale, AUC = 0.90 ± 0.06; A-scale, AUC = 0.86 ± 0.06). Areas-under-the-curve used to discriminate severe disease demonstrated high accuracy for both P-MRI scores (AUC = 0.83 ± 0.09) and A-MRI scores (AUC = 0.87 ± 0.09), but non-diagnostic ability to discriminate mild disease. Similar results were noted for radiographic scales. In conclusion, both MRI scales demonstrated substantial to excellent reliability and accuracy for discrimination of presence/absence of arthropathy, and severe/non-severe disease, but poor to moderate convergent validity for total scores and non-diagnostic discriminant validity for mild/non-mild disease. Compared with radiographic scores, MRI scales did not perform better for discrimination of severity of arthropathy.