A huge number of risk assessment tools have been developed. Far from all have been validated in external studies, more of them have absence of methodological and transparent evidence, and few are integrated in national guidelines. Therefore, we performed a systematic review to provide an overview of existing valid and reliable risk assessment tools for prediction of osteoporotic fractures. Additionally, we aimed to determine if the performance of each tool was sufficient for practical use, and last, to examine whether the complexity of the tools influenced their discriminative power. We searched PubMed, Embase, and Cochrane databases for papers and evaluated these with respect to methodological quality using the Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS) checklist. A total of 48 tools were identified; 20 had been externally validated, however, only six tools had been tested more than once in a population-based setting with acceptable methodological quality. None of the tools performed consistently better than the others and simple tools (i.e., the Osteoporosis Self-assessment Tool [OST], Osteoporosis Risk Assessment Instrument [ORAI], and Garvan Fracture Risk Calculator [Garvan]) often did as well or better than more complex tools (i.e., Simple Calculated Risk Estimation Score [SCORE], WHO Fracture Risk Assessment Tool [FRAX], and Qfracture). No studies determined the effectiveness of tools in selecting patients for therapy and thus improving fracture outcomes. High-quality studies in randomized design with population-based cohorts with different case mixes are needed.
Osteoporosis is defined as a systemic skeletal disease characterized by decreased bone strength leading to increased risk of fracture. The disease can cause significant physical disability and is associated with increased mortality. Osteoporotic fractures are a major and increasing cause of morbidity and pose a considerable burden to health services. The number of fractures in the elderly and the associated economic burden will continue to rise due to the aging of the world's population.
Osteoporosis is asymptomatic until fracture occurs but may be diagnosed using dual-energy X-ray absorptiometry (DXA). Most countries, including Denmark, have adopted a case-finding strategy whereby persons with one or more risk factors for osteoporosis may be referred for a DXA scan. This strategy, however, does not perform well, because osteoporosis remains underdiagnosed and undertreated in Denmark and elsewhere.[5, 6] Danish studies have shown that many resources are used to examine women with low risk of fracture, whereas only a few high-risk patients are referred; thus, 10% of women above the age of 40 years without risk factors and only 36% of women with three or more risk factors had received a DXA scan. Population screening could remedy this and, for instance, the U.S. Preventive Services Task Force (USPSTF) and the National Osteoporosis Foundation (NOF) are recommending that women aged 65 years and older be routinely screened for osteoporosis. We are unaware of studies investigating the performance of this strategy. At present, there is no universally accepted policy for population screening in Europe to identify patients with osteoporosis or those at high risk of fracture.
Numerous risk factors for osteoporosis and fractures have been identified and several tools have been developed to integrate risk factors into a single estimate of fracture risk for individuals. Recently developed prediction tools, such as the WHO Fracture Risk Assessment Tool (FRAX) algorithm, Qfracture algorithm, and Garvan Fracture Risk Calculator (Garvan),[11, 12] are aimed at assisting clinicians in the management of their patients through the calculation of the patient's 5-year or 10-year risk of fracture based on a combination of known risk factors. Many other tools exist and differ according to the type and number of risk factors included. Common to all these tools is the ability to identify women at increased risk of osteoporotic fracture and to stratify them into risk categories for osteoporosis or fracture. Several studies[13-17] have compared various tools for their ability to identify women at highest risk of fracture. Most of these studies reached the conclusions that the simpler tools perform as well as the more complex tools.
Targeting individuals with increased risk of osteoporotic fracture is an important challenge in the field of osteoporosis. Risk assessment tools may contribute to health care decision-making by identifying which patients would benefit most from DXA scanning or treatment.
Except for the FRAX algorithm, which has been incorporated into several national guidelines, these tools have not yet found broad acceptance; however, it is unknown why some prediction tools in common use whereas others are not. Recent reviews of clinical prediction models in other areas such as cancer,[19, 20] diabetes, and traumatic brain injury have consistently highlighted design problems, methodological weaknesses, and deficiencies in reporting as reasons for the lack of uptake. These same reasons could apply to the osteoporosis risk assessment tools.
Therefore, the aim of this systematic review was to provide an overview of existing valid and reliable risk assessment tools for prediction of osteoporotic fracture. We investigated which tools had been validated in a population-based setting in studies with properly documented methodology. We have also aimed to determine if the performance was sufficient for practical use, and last, to examine whether the complexity of the tools influenced their discriminative power.
We hypothesized that complex tools would have a better predictive accuracy than simple tools to identify women at increased risk of osteoporotic fracture.
We followed the current analytical methods and standards established by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) group for systematic reviews and meta-analyses.
We included studies of externally validated tools that were developed to identify women with an increased risk of osteoporotic fracture. The studies should have reported performance characteristics of risk assessment tools that had fracture or bone mineral density (BMD) measured by DXA as the outcome. Moreover, studies had to provide a description of participant recruitment and the methods used for selecting the variables included in the tool.
The Population, Intervention, Comparison, and Outcome (PICO) format was used to define the inclusion criteria. The population was defined as women aged 40 years and over. The “intervention” was a risk assessment tool for the identification of women with increased risk of osteoporotic fracture. The comparison component was fracture risk and the main outcome should be osteoporotic fracture or BMD (measured by DXA).
Risk assessment tools were included if they were derived from an initial population and then validated in a separate population in a new or different setting from the initial one.
Studies were excluded if they focused on secondary osteoporosis or targeted specific patient groups being treated for osteoporosis or related conditions.
We searched for relevant papers in the PubMed, Embase, and Cochrane databases. The following search terms were used: osteoporosis, osteoporotic fractures, risk assessment, risk factors, comparison, prediction, screening, and tools. Synonyms were searched for the keyword “tool”; i.e., algorithm(s), model(s), instrument(s) and questionnaire(s).
The search string used in PubMed was as follows: ((((tool OR tools OR algorithm OR algorithms OR questionnaire OR questionnaires OR models OR instruments OR instrument)) AND (mass screening OR screening OR comparison OR comparisons OR prediction OR predictions OR predictive)) AND (risk factors OR risk assessment)) AND (Osteoporosis, Postmenopausal OR Osteoporotic Fractures) AND (Humans[Mesh] AND English[lang] AND Female[MeSH Terms]).
The final electronic search was undertaken on August 18, 2012. All citations were exported to Reference Manager 12. Titles and abstracts were first screened based on the inclusion criteria. If it was uncertain whether or not a study fulfilled the inclusion criteria, it was included in the second reviewing of full-text versions. Articles were limited to English language, full-text, and published in a peer-reviewed journal. Reviews were excluded.
Full-text papers were retrieved and reviewed for quality and relevance. We then used the papers already found to make a further search in Web of Science (final search was undertaken on October 26, 2012). We assumed that validation studies on an existing risk assessment tools would cite the primary study.
Data extraction, analysis, and reporting
Two independent reviewers (KHR and TFH) screened the full-text articles for inclusion in the final review. The level of agreement between the two reviewers was 98% (184/188 full-text articles). In the four cases of disagreement, studies were included in the review.
For each paper, details on study design, site and setting, baseline BMD, risk factors, statistical methods, data collection, number of participants, inclusion and exclusion criteria, outcome, follow-up period, and results were recorded.
To assess the methodological quality of the studies we applied the Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS) checklist as recommended by the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Some modification of the checklist was necessary, including the addition of new items specific to this review topic as suggested by the developers of QUADAS. The modified QUADAS checklists are provided online (Supporting Table S3 for tools to predict low BMD and Supporting Table S4 for tools to predict fractures). In papers reporting both a development and a validation cohort in the same paper, the methodological quality was only assessed on the development cohort. Any discrepancies between the reviewers' assessments were resolved through discussion.
We evaluated the ability of the tools to differentiate between individuals with low and normal BMD or their ability to predict fracture by comparing the area under the receiver operating characteristic (ROC) curve (AUC).
Finally, we evaluated tools validated more than once in a population-based setting and with a methodologically quality above 60% (as assessed by QUADAS) to estimate if performance was sufficient for clinical use.
A total of 991 papers were identified after removing duplicates. The screening of title and abstract excluded 803 papers, leaving 188 potentially relevant papers for full-text screening. We further excluded 100 papers due to irrelevant intervention (n = 62), irrelevant outcome (n = 34), irrelevant population (n = 3), or inability to retrieve full text (n = 1). A hand search (n = 7) and a search in Web of Science (n = 9) identified 16 additional papers that meet the inclusion criteria. A further 28 papers were then excluded because the tool in the studies was developed for a single study and was either internally validated or not validated at all. Thus, 76 papers were included in the initial review (Fig. 1).
Detailed descriptions of the 76 included papers are provided in Supporting Tables S1 and S2. In all, 20 tools were externally validated; of these, eight tools (counting Osteoporosis Self-Assessment Tool [OST] and Osteoporosis Screening Tool for Asians [OSTA] together) were developed to predict low BMD (Table 1) and 12 tools to predict fractures (Table 2). Five of the tools (Age Body Size No Estrogen [ABONE], Body Weight Criterion [BWC], OST, Osteoporosis Risk Assessment Instrument [ORAI], and Simple Calculated Risk Estimation Score [SCORE]) developed to predict low BMD were also validated regarding their ability to predict fractures.[28-30]
|Tool (reference)||Validation studies n (references)||Risk factors in the tool||Enrolled women total n (mean); rangea||Age (years) mean (range)a||AUC (range)a|
|ABONE||4[29, 68, 72, 78, 80]||Age, weight, estrogen use||4346 (1086); 135–2365||63.6 (59.4–68.4)||0.67–0.72|
|BWC||8[68, 70, 72, 73, 76, 78, 80, 86]||Weight||5088 (636); 135–2365||61 (54.2–66.4)||0.13–0.79|
|NOF||3[73, 77, 80]||Age, weight, previous fracture, parental fracture, smoking||3241 (1080); 351–2365||63.9 (57.3–68)||0.60–0.70|
|OPERA||1||Age, weight, previous fracture, glucocorticoid use, early menopause||1522||63.1||0.81|
|ORAI||16[68-81, 86, 90]||Age, weight, current estrogen use||57,625 (3602); 135–32,513||60.8 (50.5–68.4)||0.32–0.84|
|OSIRIS||7[70, 71, 76, 90, 92-94]||Age, weight, previous fracture, current estrogen use||6840 (977); 207–4035||59.8 (54.2–62.7)||0.63–0.80|
|OST||13[30, 68-71, 73-76, 78, 79, 86, 90]||Age, weight||62,825 (4833); 207–32,513||58.5 (50.5–62.4)||0.32–0.82|
|OSTA||7[68, 72, 96-100]||Age, weight||5937 (848); 135–1597||61.5 (59.1–68.4)||0.65–0.85|
|SCORE||19[69, 71, 72, 74-81, 90, 102-108]||Age, weight, previous fracture, estrogen use, RA, race||61,314 (3227); 117–32,513||62.2 (50.5–72.5)||0.65–0.87|
|Tool (references)||Validation studies n (references)||Risk factors in the tool n (factors)||Fracture n (%)a||Fracture measurea||Follow-up (years) mean/rangea||Enrolled womena||Age (years) mean (range)a||AUC (range)a|
|ABONE||1||3 (age, weight, estrogen use)||Total fx: 70 (15%)||Self-reported||No||469||69||0.63|
|BWC||1||1 (weight)||Total fx: 70 (15%)||Self-reported||No||469||69||0.60|
|EPESE||1||8 (age, BMI, race, previous stroke, ADL or vision impairment, antiepileptic drug use, ≤1 Rosow-Breslau impairment)||Total fx: 276 (13%); Hip fx: 106 (5%)||Self-reported||6–10||2124||74.5||Any fx: 0.57–0.75|
|Ettinger and colleagues||1||6 (age, weight, prior fx, current smoking, hip fx in mother or sister, BMD)||NS||NS||3.7||NS||No mean (45–79)||No AUC|
|FRACTURE index (SOF)||1||7 (age, weight, prior fx, maternal hip fx, smoking, use of arms to stand up from a chair, BMD)||Hip fx: 261 (4%)||Radiographic rapports for all reported fx||4||6679||80.5||No AUC|
|FRAMO||1||4 (age, weight, prior fx, impaired rise-up ability)||Fragility fx: 14 (5%); Hip fx: 7 (2.5%)||Radiographic film reports||2||285||79||No AUC|
|FRAX||18[9, 13-17, 32-35, 46, 49, 65-67, 112-114]||11 (age, BMI, prior fx, family history of fx, glucocorticoids, smoking, alcohol use, secondary OP, RA, sex, BMD)||OP fx: 29,603 (8%)||NS (2), self-reported (3), radiographs (7), self-reported and confirmed (6)||2–13.4||391,334||63.3 (54.4–74.0)||MOP fx: 0.54–0.78; Hip fx: 0.65–0.81|
|FRISC||2[14, 31]||8 (age, weight, prior fx, menopausal, secondary OP, back pain, dementia, BMD)||Total fx: 250 (21%)||Vertebral fx confirmed by radiographs||5.1–5.3||1165||61.4 (59.5–63.3)||Vertebral fx: 0.70; Vertebral and long bone fx: 0.69; Any Fx: 0.72|
|Garvan[11, 12]||4[13, 16, 17, 51]||5 (age, prior fx, falls, BMD, sex)||OP fx: 1,307 (5%)||Self-reported (2), radiographs (1), self-reported and confirmed (1)||1.7–8.8||25,108||70.8 (68–74)||0.76–0.84|
|OC||1||5 (age, prior fx, sex, systematic corticosteroid, BMD)||1.6/3.8/8.4(% in low, moderate, and high risk group)||Verified||3.1||16,205||65||No AUC|
|ORAI||1||3 (age, weight, current estrogen use)||Total fx: 70 (15%)||Self-reported||No||469||69||0.65|
|OST||1||2 (age, weight)||Total fx: 225 (3%)||Confirmed in health service records||3.3||8254||52.7||0.56|
|Qfracture||3[10, 15, 64]||18 (age, sex, BMI, smoking, alcohol, diabetes type 2, parental history of hip fx/OP, falls, asthma, cardiovascular disease, chronic liver disease, RA, gastrointestinal malabsorption, use of tricyclic antidepressants, HRT or corticosteroids, endocrine problems, menopausal symptoms)||OP fx: 21,253 (1.2%); Hip fx: 14,589 (1%)||Recorded on the GP computer records||5.98–6.8||1,760,719||57.8 (48–68)||OP fx: 0.67–0.82; Hip fx: 0.64–0.89|
|Qfracture updated||1||31 (the 18 risk factors from the 2009 Qfracture plus 13 new: race, previous fx, chronic obstructive pulmonary disease, antidepressants, epilepsy, anticonvulsants use, dementia, Parkinson's, any cancer, lupus erythematosus, chronic renal disease, type 1 diabetes care or nursing home residence||OP fx: 21,677 (3%); Hip fx: 7089 (1%)||Recorded on the GP computer records||7.4||804,563||50||OP fx: 0.79; Hip fx: 0.89|
|SCORE||1||6 (age, weight, previous fx, estrogen use, RA, race)||116 (20%)||Self-reported||2.8||576||69.3||No AUC|
|SOF||2[28, 115]||14 (age, weight loss, height, any fx, maternal history of hip fx, self-rated health, physically inactivity, benzodiazepine use, anticonvulsant drug use, pulse >80 beats/min, caffeine, unable to rise from chair without help, pervious hyperthyroidism, BMD).||Total fx: 263 (7.3%)||Self-reported (1), radiographic reports (1)||2.8–5||3586||65.5 (60–84)||No AUC|
|Van Staa and colleagues||1||9 (age, race, BMI, history of stroke, cognitive impairments, ADL impairments, one or more Rosow-Breslau impairments, anti-epileptic drug use, sex)||Hip fx rates were higher than development study (1.8%)||Verified||5.6||32,728||≥50||No AUC|
|WHI||2[38, 117]||11 (age, height, weight, fx after age 55 years, race Self-reported health, physical activity, current smoking, parental hip fx, corticosteroid use and hypoglycemic agent use)||Hip fx: 913 (1.1%)||Self-reported and then confirmed by medical records||5–8||81,485||61.9 (61–62.7)||Hip fx: 0.80–0.82|
Risk assessment tools predicting BMD
Eight tools predicting low BMD were externally validated in a total of 31 studies (Table 1). There were large variations in how many times each tool had been validated; e.g., the Osteoporosis Prescreening Risk Assessment (OPERA) tool was externally validated once, whereas the ORAI, SCORE, and OST/OSTA tools were externally validated in 16, 19, and 19 studies, respectively. A number of tools were validated in studies comparing several tools. All tools were developed between 1996 and 2005; BWC and SCORE were the first and OPERA the most recent.
The definition of low BMD in the nine tools varied slightly: SCORE and ORAI used T-scores ≤ −2.0 as the cutoff whereas OST/OSTA, BWC, Osteoporosis Index of Risk (OSIRIS), OPERA, National Osteoporosis Foundation (NOF), and ABONE used T-scores ≤ −2.5. The tools were primarily developed and validated for postmenopausal white/Caucasian women, except for six studies that validated the OSTA tool for Asian women and one study that validated OST, SCORE, ORAI, ABONE, and BWC for African-American women.
The age of women included in the 31 studies was 40 to 98 years with a mean age between 51 and 73 years. The women were youngest in the validation studies of OSIRIS and OST.
In total 78,588 women (mean, 2535; range, 117–32,513) were enrolled in the 31 studies. The OST, SCORE, and ORAI tools were validated on the highest numbers of subjects. The tools included between one and six clinical risk factors in their algorithms (Tables 1 and 3). The simplest were BWC (including weight) and OST/OSTA (including age and weight). The study design was cross-sectional except in one retrospective cohort study. Most of the studies used regression analysis and reported sensitivity, specificity and ROC analyses. The reported AUC estimates ranged from 0.13 to 0.87, with most between 0.60 and 0.80. Some tools had higher AUC estimated in selected studies, but none demonstrated high estimates in several studies and none performed consistently better than others.
|Risk factors||Tools predicting fractures||Tools predicting low BMD|
|EPESE||Ettinger||Fracture index||FRAMO||FRAX||FRISC||Garvan||OC||Qfracture||Qfracture updated||SOF||Van Staa||WHI||ABONE||BWC||OPERA||ORAI||OSIRIS||OST/OSTA||NOF||SCORE|
|Previous low-energy fx||X||X||X||X||X||X||X||X||X||X||X||X||X||X||X|
|Parental (hip) fx/family history of fx or OP||X||X||X||X||X||X||X||X|
|Secondary osteoporosis (as defined in FRAX)||X||X||X||X||X|
|Type 2 diabetes||X||X|
|History of stroke||X|
|Chronic obstructive pulmonary disease||X|
|Systemic lupus erythematosus||X|
|Chronic renal disease||X|
|Chronic disease with GP visit/hospitalization||X|
|Hypoglycemic agent use||X|
|Anticonvulsant drug use||X||X|
|Central nervous system medication||X|
|Care or nursing home residence||X|
|ADL or vision impairment||X|
|Pulse >80 beats/min||X|
|Use arms to stand up from a chair||X||X|
|Impaired rise-up ability||X|
Most studies (29/31) reported sensitivity and specificity at the cutoffs suggested by the developers. Sensitivity ranged from 50% to 100%, with most between 80% and 90%. Specificity ranged from 10% to 88%, with most around 50%. Several studies reported analyses with other cutoffs finding better results.
All primary papers stated the final algorithm
OPERA and OSIRIS were validated in selected populations whereas the other tools were validated in at least one general population setting.
Five tools developed to predict low BMD were validated in studies with fracture outcome.[28-30] In the study by Morin and colleagues, OST and BWC only had AUC values of about 0.55 in predicting fractures. Wei and Jackson concluded that ABONE, ORAI, and BWC only moderately correlated with clinical fractures. The SCORE tool had the highest age-adjusted fracture rates compared with two other screening strategies in the study by LaCroix and colleagues.
Methodological quality of the studies describing tools to predict low BMD
According to our assessment, the 39 studies developing or validating tools to predict low BMD complied with a mean of 11.6 items (range, 6–14) out of the possible 19 QUADAS items (our QUADAS scoring of the individual studies is reported in Supporting Table S3).
Only eight studies could be considered population-based (ie., included unselected women from the general population) (Fig. 2). Only two studies accounted for uninterpretable test results and presented results for all subjects who were described as having been entered into the study. Furthermore, only seven studies accounted for the whole study population using, e.g., flowcharts. Most of the studies (n = 23) were relatively small, with study samples below 1000 subjects, but 16 studies had more than 100 events (women with low BMD). Data on the risk factors included in the tools were reported in 16 studies based on interview and not only from self-reported data through questionnaires. Missing data were accounted for in 19 studies. The remainder of the studies either applied the tools without knowledge about some of the information or the procedure regarding missing values was unclear. Although no study was explicit about blinding, we scored five studies with “yes” because the data on risk factors was obtained from the general practitioner or medical reports before the patient was DXA scanned or else the questionnaire was sent to the subject before the DXA scan. For 14 studies, we scored “yes” to the item of whether the DXA scan was analyzed blind to the tool results, because the study population was referred to a DXA scan before entering the study. In 31 studies, the DXA was described in sufficient detail to permit its replication.
Risk assessment tools predicting fractures
Twelve tools and an updated version of Qfracture developed to predict fractures had been subject to external validation in 33 studies with fracture as an outcome (Table 2).
Six tools to predict fractures had been externally validated once, three tools (Study of Osteoporotic Fractures [SOF], Women's Health Initiative [WHI], and Fracture and Immobilization Score [FRISC]) twice, Qfracture three times, Garvan four times, and FRAX 18 times. Most of the tools were validated in studies in which the authors compared several tools or compared a tool with simpler models or a single risk factor.
The tools were developed between 1995 (SOF) and 2010 (FRISC) and were primarily developed and validated for postmenopausal white/Caucasian women. However, the FRISC was both derived and validated for Asian women, and FRAX was externally validated twice in an Asian population.[32, 33] Only one study had a population with a high proportion of African-American women (38%).
The age of the women included in the 33 studies ranged from 30 to 100 years (mean, 48–81 years). The women were youngest in the validation study of the updated Qfracture (mean, 50 years) and oldest in the study of the FRACTURE index (mean, 81 years).
A total of 3,105,136 women (mean, 94,095; range, 200–1,117,982) were enrolled in the 33 studies. Studies on Qfracture (plus updated version) and FRAX were validated on the most women (n = 1,760,719 for Qfracture; n = 804,563 for Qfracture plus updated version; and n = 391,334 for FRAX). The tools included between one and 31 risk factors in their algorithms (Tables 2 and 3). The most complex were FRAX (n = 11 risk factors), WHI (n = 14), SOF (n = 14), Qfracture (n = 18), and Qfracture updated version (n = 31). The tools for predicting fractures generally included more risk factors (mean, 8.4 risk factors) than the tools for predicting low BMD. All the fracture-predicting tools included age; thereafter, the four most frequently included risk factors in the final models were weight, prior fracture, BMD, and maternal/parental history of fracture. All of the algorithms were available in the literature except for the FRAX tool, which is the only algorithm still not in the public domain.
Incident fracture rates in the validation studies varied from 1% to 21%. Most studies used a prospective design, whereas five studies were retrospective.[15, 17, 34-36] The mean follow-up period varied between 1.7 and 13.4 years.
Most of the studies used ROC, regression (mostly Cox regression), and survival analysis. The reported AUC estimates ranged from 0.54 to 0.84 for major osteoporotic fractures and 0.64 to 0.89 for hip fractures, with most between 0.60 and 0.80. As with the tools predicting BMD, some fracture-predicting tools had a higher AUC in selected studies, but none demonstrated high estimates in several studies and none performed consistently better than the others.
The Osteoporosis Canada (OC), WHI, and the tool developed by Ettinger and colleagues were validated in subgroups of the population. The other tools were validated at least once in a general population setting.
Methodological quality of the studies describing tools to predict fractures
According to our assessment, the 37 studies of tools predicting fractures complied with a mean of eight items (range, 4–11) out of the possible 13 QUADAS items (our QUADAS scoring of the individual studies is reported in Supporting Table S4).
About one-half of the studies included subjects from the general population (Fig. 3). Only seven studies included a flowchart or a description of the whole study population and none of the studies clearly reported intermediate or uninterpretable test results or accounted for all subjects who were potentially included in the study.
In approximately one-half of the studies, the data on risk factors were obtained from interview, and only six studies accounted for missing data in the calculation of the tools.
Most of the studies included over 1000 subjects and had over 100 major osteoporotic fractures during follow-up. In 29 studies, fractures were verified by radiographic reports or medical records; in eight of these 29 studies, fractures were self-reported but then confirmed by radiographic reports or medical records afterward.
Only 15 studies used a follow-up time corresponding to the period for which the tool had been intended (5 or 10 years for all subjects included in the study, depending on the outcome period of the tools).
In one case, the tool did not have a final model and for an other tool the final model was not clear in the papers, but the final model could be found on a website.[10, 40] One study was not clear about which version of FRAX was used. The rest of the developmental studies described the final model for the tool, and the validation studies described the included tool/tools in details.
We found that 26 studies were carried out in population-based settings with an acceptable methodological quality. In these studies, only six tools—ORAI, SCORE, OST, FRAX, Garvan, and Qfracture—were validated more than once with an acceptable sample size for events (“yes” to item 19 in QUADAS; 5, 4, 4, 4, 2, and 2 times, respectively in 14 studies). The total number of women in these validation studies was; 14,124 for ORAI, 12,653 for SCORE, 18,637 for OST, 18,452 for FRAX, 5322 for Garvan, and 1,760,135 for Qfracture (Table 4). Figures 4 and 5 show the ability of these tools to predict low BMD or fractures (i.e., AUC values) in relation to the number of risk factors included in the tool and the sample size of the validation studies.
|Author (reference)||Tool||Women (n)||AUC (low BMD)||AUC (MOP fracture)|
|Machado and Da Silva||OST||588||0.65|
|Gourlay and colleagues||OST||7779||0.72|
|Rud and colleagues||OST||2016||0.68|
|Morin and colleagues||OST||8254||0.77|
|Machado and Da Silva||ORAI||588||0.67|
|Gourlay and colleagues||ORAI||7779||0.70|
|Rud and colleagues||ORAI||2016||0.64|
|Cadarette and colleagues||ORAI||2365||0.79|
|Cadarette and colleagues||ORAI||1376||0.79|
|Gourlay and colleagues||SCORE||7779||0.71|
|Rud and colleagues||SCORE||2016||0.68|
|Cadarette and colleagues||SCORE||2365||0.80|
|Cadarette and colleagues||SCORE||493||0.71|
|Langsetmo and colleagues||Garvan||4152||0.69|
|Bolland and colleagues||Garvan||1170||0.63|
|Bolland and colleagues||FRAX||1170||0.62|
|Hillier and colleagues||FRAX||6252||0.62|
|Fraser and colleagues||FRAX||4778||0.69|
|Ensrud and colleagues||FRAX||6252||0.67|
|Collins and colleagues||Qfracture||1,117,982||0.82|
|Hippisley-Cox and Coupland||Qfracture||642,153||0.79|
AUC estimates ranged from 0.64 to 0.80 and from 0.62 to 0.82 in the validation studies of tools to predict low BMD and tools to predict fractures, respectively. The simplest of these tools were OST, ORAI, and Garvan, with 2, 3, and 5 risk factors included in the algorithm, respectively.
Of these six tools, only ORAI, Garvan, and Qfracture were also developed in studies with a population-based setting and applying proper methodology.
A total of 13 tools were compared and validated against each other in 31 studies (Table 5). Thus, 12 studies compared different tools with fracture outcome, 16 studies compared different tools with low BMD as outcome, and three studies compared tools with BMD outcome against fracture. The overall finding conclusion of in these comparative studies was that simple tools with fewer risk factors or simpler models often did as well or better than more complex tools with more risk factors.
Risk-assessment tools in clinical practice
We identified three studies evaluating the use of risk assessment tools in clinical practice in a randomized design.[28, 41, 42]
LaCroix and colleagues evaluated three screening strategies involving either BMD testing alone or evaluation by risk assessment tools followed by BMD testing if the tool indicated high risk. Of the 9268 women invited, 3167 participated. They were allocated to (1) DXA scan, (2) SCORE followed by DXA if SCORE was ≥7, or (3) SOF followed by DXA for ≥5 hip fracture risk factors. After 33 months of follow-up, the screening program was shown to have influenced total fracture rates, but there were no intergroup differences in osteoporosis treatment rates.
Dargent-Molina and colleagues (EPIDOS study) compared four different screening strategies among 5910 of the 7575 invited women: (1) DXA alone, (2) quantitative ultrasound (QUS) alone, (3) QUS followed by DXA among subjects with medium-low QUS, and (4) DXA among subjects <59 kg followed by clinical evaluation. After 3.7 ± 0.8 years of follow-up, strategy 3 and 4 had the same discriminant value as strategy 1, but resulted in less than 50% BMD examinations. A combination of strategy 3 and 4 would allow an increased number of women to be identified as being at high risk.
Barr and colleagues used a combination of known risk factors for fractures and a QUS heel scan. Of the 5306 women invited, 2515 participated and were randomly assigned to screening or control groups. After 1 to 3 years of follow-up, the risk of fracture was reduced in the screening group.
In the last decades many different risk assessment tools have been developed, but only 20 of 48 identified risk assessment tools for prediction of osteoporotic fractures were externally validated; eight tools were developed to identify individuals at risk of low BMD and 12 tools were developed to identify individuals with an increased risk of fractures. Only six tools (OST, ORAI, SCORE, Garvan, FRAX, and Qfracture) were externally validated in a population-based setting with proper methodological quality. None of the tools performed consistently better than others when tested in external validation studies, and simple tools with fewer risk factors often did as well or better (i.e., OST, ORAI, Garvan) than more complex tools with more risk factors (i.e., SCORE, FRAX, Qfracture).
We did not identify studies that prospectively tested tools in a population-based setting with a randomized design or determined its effectiveness in selecting patients for therapy. On the other hand, three studies evaluated the use of risk assessment tools in clinical practice in a randomized design but none of the studies demonstrated improved fracture outcome when using risk-assessment tools in clinical practice to identify individuals for screening for osteoporosis or treatment.
Scoring the methodology by the QUADAS checklist, we found that most of the included studies had several shortcomings. Similar results were reported in previous reviews of clinical prediction tools regarding osteoporosis,[7, 43, 44] but also cancer,[19, 20] diabetes, and traumatic brain injury. Nevertheless, all studies in this review had participant selection criteria described; the tools were well described and the final model/algorithm was presented with few exceptions. The clinical risk factors included in the tools were possible to collect in clinical practice and in most studies the participants were adequately described. However, these attributes were overshadowed by issues such as poor study design, inappropriate study samples, and incomplete reporting, especially of subject withdrawals, data collection methods (self-report or assisted interview), treatment of missing data, and uninterpretable test results.
In many cases, inadequate reporting made it difficult for us to judge the quality of the studies, thus resulting in several items classified as unclear. Other authors reported similar experiences and noted their disappointment that many risk assessment tools have been developed and used without greater consideration of methodological shortcomings.[18, 43]
The development, validation, and performance assessment of prognostic tools are complex processes, and rely on a good understanding of statistical methods and their appropriate application. The lack of transparency, inadequate reporting, and inappropriate methodological choices may partly explain why so few risk assessment tools are used in clinical practice.[18, 19]
A single tool, i.e., FRAX, has been incorporated into several national guidelines; however, FRAX was scored low against QUADAS. Other authors have raised similar criticism regarding both the original development and validation cohorts and the insufficient information available for objective evaluation of FRAX.
Item 1 in the QUADAS checklist (concerning the representativeness of the sample) opens up the question of generalizability. Most validation studies of tools to predict low BMD were evaluated in selected populations rather than the general population. Even among the 26 studies evaluating tools to predict fractures that we characterized as population-based, few resulted in truly generalizable findings because of the high proportion of subject withdrawals and nonrespondents. In one of the reviewed studies, for instance, only 18% of the subjects from the affiliates of a large insurance company participated, even though they were randomly selected. Nonresponse at random is rare and participants and nonparticipants often differ, thus limiting external validity studies. Some authors[11, 12, 16, 49-52] commented, for example, that the participants were healthier, had lower mortality rates, and were older than the general population. Thus, results can hardly be generalized from the studies with very low participation rates,[9, 11, 28, 46, 51] because the extent to which a cohort sample is representative of the total reference population depends on the completeness of the population frame available to the investigator.
Furthermore, the omission of uninterpretable test results and subject withdrawals from the analysis usually leads to an overestimation of the accuracy/differentiation of the tool.
Data collection and quality of data were also issues in several studies because it was uncertain whether the data were self-reported or collected with assistance, and whether the participants answered all questions or if there were some missing items. Few studies had complete data on the entire study population. Some of the authors handled missing data by multiple imputations, but most did not account for the missing data. Thus, the possibility of inaccurate and biased estimates of performance in the studies that did not account for missing data.
Sample sizes were generally—and appropriately—higher in studies validating tools to predict fractures than in studies of tools to predict low BMD. Thus 30 of 37 studies of tools to predict fractures had a sample size of at least 1000 women, compared to 16 of 39 studies of tools to predict low BMD. No firm guidelines exist on sample size requirements for development or validation studies.[18, 54]
In terms of whether the studies could be underpowered, it is worth emphasizing that the effective sample size is not driven by the number of subjects in the cohort, but the number of events (i.e., fractures or low BMD). Empirical simulations have found that at least 100 events are required to validate a prediction tool. Thus, 10 of the reviewed studies of tools to predict low BMD and eight of the studies of tools to predict fractures were underpowered.
Assessing performance of a predictive tool
It is important that osteoporosis risk assessment tools are accurate and effective in determining those at risk of disease.[55-57] Ideally, they should include an appropriate number and range of risk factors and these should be easy to record in clinical practice.
In nearly all the studies, performance of the tools was evaluated using ROC analysis. The AUC estimates mostly ranged between 0.6 and 0.8, indicating that the tools are modest predictors of low BMD or fractures. In general, an AUC of at least 0.80 is considered necessary for a diagnostic test to be effective. None of the tools produced AUC estimates above 0.80 in more than one validation study.
Kanis and colleagues criticized the use of ROC analysis in this context. First, ROC analysis may lack sensitivity to additional variables. Second, it may be inappropriate to compare AUCs across studies. Third, ROC analysis is unable to determine the use of a clinical tool to identify risk categories for intervention. Although nearly all the tools assessed performance through ROC analysis, we are aware of the limitations in the use of ROC analysis and especially of the danger in comparing AUC between studies. Our conclusions, however, are mostly based on the number of external validation studies on the tool and the studies' methodological quality.
Further, it is argued that calibration of the tool is also important for prediction models to be able to compare the accuracy of tools.[21, 60] Despite the importance of calibration of the tools, we and other systematic reviews have all found that calibration is frequently not done.[19, 21]
Strengths and limitations
A major strength of this review is that we performed a comprehensive systematic search of the literature retrieved from several databases. We followed the analytical methods and standards established by the PRISMA group for systematic reviews and we assessed the methodological quality of the included studies. Two reviewers independently reviewed the papers in proportion to the inclusion/exclusion criteria and assessed the methodological quality.
In our experience, the use of the QUADAS checklist caused some problems, as reported by others.[26, 43] First, it is challenging to adapt the generic items used to assess the quality of diagnostic accuracy to fit the context in the review when assessing the quality of risk assessment tools. Second, it was difficult to score some of the items, especially items concerning uninterpretable test results and withdrawal, but also in relation to poor standard of reporting. These difficulties will also tend to reduce the strength of some dimensions of methodological quality.
A limitation of our study is that we only included studies published in the English language and that we did not include “gray literature” (i.e., abstracts and unpublished data) or clinical trials databases. A second potential limitation could be that we only included women. However, osteoporosis is most prevalent in women and screening programs are likely to be implemented only in women.
We assessed the performance of the tools in the general population and the conclusions rely on their performance in this setting. We are aware of the possibility that some risk factors could be of particular relevance in other settings, for example corticosteroid use and rheumatoid arthritis in a rheumatology practice, or aromatase inhibitors or androgen deprivation therapy in an oncology practice. Ultimately, the need to include such risk factors depends both on their effect size and on their prevalence in the population. We recommend that such tools be developed and validated in a setting that is representative of the population in which the tool will ultimately be applied; hence, the greatest societal impact will generally be provided by a tool that can be used successfully in the general population.
Risk factors for osteoporosis
Most tools included the same risk factors but the number of risk factors varied greatly. All but one of the 20 tools included age in the algorithm. Age is also strongly associated with fracture risk and has the strongest known association with BMD. The next most frequently included variables were body weight, prior fracture, and maternal history of fractures (Table 3). These three risk factors are also consistently associated with increased fracture risk.[1, 62, 63] Tools to predict fractures generally included more risk factors than did tools to predict low BMD.
Complex versus simple
The simplest tools included one or two risk factors, whereas the most complex tool included 31 risk factors. OST was one of the simplest tools to predict low BMD and Garvan the simplest to predict fractures.
The most complex tool is the updated Qfracture, with 31 clinical risk factors included in the algorithm.
As noted by other authors, the usefulness of a tool not only depends on its diagnostic accuracy but also on its ease of use. Risk factors should be unambiguous and easily determined—ideally through patient self-report. Although the developers of Qfracture expected that the included risk factors would be either known to the patient or could be collected in routine clinical practice, we question the applicability of such a complex tool. Patients will not always be able to reliably report past use of glucocorticoids or estrogen, for example. Tools that would be more amenable for self-report include OST, BWC, NOF, and Garvan.
Comparing the simple versus complex tools
Several of the studies compared tools against each other with respect to their ability to identify individuals at increased risk of osteoporotic fracture or low BMD, especially as to whether the more complex tools perform better in this regard than simpler tools.[13-17, 28, 32, 33, 46, 49, 65-81] Most of these studies reached the conclusions that simpler models or tools perform as well as the more complex (typically FRAX) in predicting fractures. Kanis and colleagues recently raised criticisms of numerous studies in the field and concluded that most of the studies had used inappropriate methodologies with overreliance on AUC and comparisons with models derived from the validation population itself. Thus, in the examination and evaluation of the tools included in this systematic review we only included externally validated tools and not “homegrown” models.
We found no difference in the predictive performance (in terms of AUC) of the simple and complex tools. Simpler tools with fewer risk factors would be easier to use in clinical practice by the doctor or the patient herself, and thus may be sufficient to identify women with increased risk of osteoporotic fractures who should be referred for a DXA scan.
It remains to be seen whether well-designed population-based studies in which participants are randomly allocated to a screening group with a risk assessment tool and a control group to evaluate the cost-effectiveness of screening for osteoporosis would lead to the same conclusions.
We are aware of two ongoing population-based, randomized studies in progress, in which patients are selected for DXA using FRAX and with fractures as the primary outcome. The English study (SCOOP) included approximately 12,000 women aged 70 to 85 years. Participants were randomly allocated via primary care to receive either questionnaire-based screening (with BMD measured by DXA in selected participants), or standard care. The Danish study (ROSE) (Holmberg and colleagues, “The Risk-Stratified Osteoporosis Strategy Evaluation Study [ROSE]”; manuscript in preparation) currently includes 35,000 women aged 65 to 80 years randomly allocated to a screening group and a control group. Both group received a questionnaire about risk factors for osteoporosis and those with a major osteoporotic fracture at 15% or more in the screening group were provided a DXA scan.
Ultimately, these studies may provide data on the effectiveness of FRAX in improving fracture outcome and cost-effectiveness in screening. Further, a U.S. study (POROS) is ongoing to determine the clinical utility of a three-step fracture risk screening program in a population of younger women (aged 50–65 years). Approximately 3000 women received a questionnaire and those with one or more risk factors for fractures (from a modification of the FRACTURE index and additional risk factors from FRAX) were randomized with a 4:1 allocation to either the intervention or nonintervention group. As in the ROSE study, results are not yet available.
We suggest that future research should also focus on validating the simpler tools such as OST, ORAI, and Garvan in similar settings to those described in the above paragraphs.
Many different risk assessment tools have been developed but only 20 tools were externally validated and only six tools (OST, ORAI, SCORE, Garvan, FRAX, and Qfracture) were validated in a population-based setting with a proper methodological quality. No tool performed consistently better than others and simple tools with fewer risk factors often did as well or better (i.e., OST, ORAI, and Garvan) than more complex tools with more risk factors (i.e., SCORE, FRAX, and Qfracture). No studies determined the effectiveness of tools in selecting patients for therapy and thus improving fracture outcomes.
High-quality studies of randomized design with population-based samples or cohorts with different case mixes are needed. We suggest that future research also focus on validating the simpler tools (for example Garvan) in the field of using risk assessment tools in targeting individuals at increased risk of osteoporotic fractures.
All authors state that they have no conflicts of interest.
The funding sources (for KHR), The Region of Southern Denmark (JNR 08/8133 and JNR 11/5761) and Odense University Hospital, had no role in the study. We thank Claire Gudex for linguistic editing and proofreading of the manuscript.
Authors' roles: KHR: conducted the literature search, screened all papers, performed the analyses, and drafted the manusscript. THR: Screened the full-text papers. All authors contrubuted to the writing of the manuscript.