Prediction Models to Estimate the Future Risk of Osteoarthritis in the General Population: A Systematic Review

To evaluate the performance and applicability of multivariable prediction models for osteoarthritis (OA).


INTRODUCTION
In recent years, risk prediction models have grown increasingly popular and are used to estimate the likelihood of the incidence of health-related outcomes. These models assist clinicians, complementing clinical decision-making and aiding the provision of information to patients. The models also contribute to public health, identifying future health care needs for the wider at-risk population (1). With older people constituting a growing proportion of the global population, disease burden is increasingly associated with noncommunicable diseases, for example cardiovascular, cancer, diabetes mellitus, and musculoskeletal disorders (2). Models predicting an individual's future risk of developing these conditions, permitting the modification of risk factors while patients remain free of disease, may contribute to their prevention. Preventing noncommunicable disease is a global priority. In 2015 the UN Sustainable Development Goals program outlined this prevention as a 15-year aim (3). While risk prediction models have been derived, validated, and implemented in clinical medicine and public health screening programs to predict the incidence of cardiovascular disease (4)(5)(6), their use in musculoskeletal disorders, such as osteoarthritis (OA), The views expressed in this publication are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care.
Drs. Thomas and Antcliff's work was supported by the NIHR (Development and Skills Enhancement awards NIHR 300818 and NIHR 301005, respectively).
remains uncommon. The current systematic review was motivated by a desire to understand how close such a prospect may be in OA and what remaining limitations may need to be addressed.
OA is a chronic, painful condition that poses significant challenges to public health. In recent years, the disability-adjusted life-years associated with OA have risen markedly, estimated by the Global Burden of Disease project to have increased 34% between 1990 and 2015 (2). OA also has significant impacts on health care utilization, including surgical intervention (7,8), and total costs are estimated to represent up to 0.5% of high-income nations' gross domestic product (9). With health careassociated costs of OA predicted to rise (10,11), validated risk prediction models are needed to identify high-risk patients, permitting the communication of risk, stratification of care, and attempts at risk-informed prevention. Furthermore, models to predict disease incidence may provide insight into the classification and diagnosis of early OA, which has increasing interest due to its chronic, progressive nature, alongside a growing focus on the prognosis, rather than solely the diagnosis, of the condition (12,13).
Several studies have developed risk prediction models for OA outcomes, but to date, no systematic synthesis of this evidence has been published. Our systematic review identifies and critically synthesizes published studies deriving and validating multivariable risk prediction models for predicting individualized risk of OA incidence within general populations. The motivating questions for our review were to summarize currently published models, evaluating their applicability to large-scale use in clinical practice.

MATERIALS AND METHODS
Literature search. We conducted preliminary literature searches before finalizing our search strategy and specifying our research protocol following the Preferred reporting Items for Systematic Review and Meta-analysis Protocols guidelines (14) (PROSPERO registration number: 4220446; approved November 2020). We searched PubMed, EMBASE, and Web of Science from inception to December 2021. Our searches used the modified Ingui filter, a generic filter for clinical prediction modified by Geersing et al for greater sensitivity, together with conditionspecific terms relating to OA (15 Eligibility criteria. We included any original study of a longitudinal design (including randomized controlled trials, cohort and [nested] case-control studies) conducted in a general population sample that developed, compared, or validated a multivariable prediction model to predict an individual's risk of future OA incidence, irrespective of the time span for prediction. Articles presenting a clinical prediction score based on a model, as well as those evaluating prediction model impact, were also eligible. Eligible definitions of OA included symptomatic, radiographic, and symptomatic-radiographic. We excluded studies of hospital inpatients and other selective settings, prognostic models of patients with existing symptomatic or radiographic disease, and those using arthroplasty as the sole outcome. Cross-sectional studies, case reports/series and conference abstracts were excluded. Titles and abstracts of studies were required in English; no language restriction was placed on articles eligible for full-text review.
Screening. Search results of the 3 searched databases were exported to the reference software Rayyan (16). TA undertook deduplication. Title screening was undertaken by a single reviewer (TA, MJT, DA, or GP), with a sample of decisions checked by a second reviewer. Authors then worked in pairs (TA and MJT, and DA and GP, blinded to the other's decision) to screen abstracts, with conflicts resolved by a third reviewer (MJT or GP) not involved in the original decision. Upon full-text review, paired authors again worked independently with a third reviewer to resolve conflict; reasons for exclusion were documented.

SIGNIFICANCE & INNOVATIONS
• Prediction models support earlier intervention in several diseases, but to date, not in osteoarthritis (OA). • Our systematic review provides a comprehensive and critical synthesis of 31 published multivariable models derived by international research teams to predict the future individual risk of developing OA. • We found generally good performance and evidence of increasing use of internal and external validation. However, a focus on knee OA, a reliance on a restricted number of cohort data sets, mainly from higher-income countries, and use of data sources that may be challenging to scale up in routine practice, may limit the applicability of many existing prediction models in general populations. • The emergence of prediction modeling using routine health care data and improvements in analytic methods may help address some, but not all, of these limitations.
Data extraction. Data extraction of eligible studies was performed by paired reviewers (TA and MJT, DA and GP), using a shared Microsoft Excel (17) worksheet incorporating items for extraction as outlined by Cochrane (18) and the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies checklist (19). Data extracted included general study information, followed by model-specific data, for instance relating to study design, sample size, outcome definition, and included predictors. Last, we extracted performance metrics (overall fit, discrimination, and calibration). The data extraction template is available in Supplementary Appendix A, available on the Arthritis Care & Research website at http://onlinelibrary.wiley.com/doi/ 10.1002/acr.25035. Where studies presented final models for multiple eligible outcomes, information was extracted for each model. In those with more than 1 model per outcome, a final model was identified, based on the authors' own designation or inferred from their description of the model-building process and intended application.
Risk-of-bias assessment. For all included articles, we assessed the risk of bias across 4 predetermined domains (participants, outcome, predictors, and analysis) using the Prediction study Risk of Bias Assessment Tool (20). Risk of bias and applicability were scored as low, high, or of unclear risk, with applicability appraised with respect to large-scale use in diverse general populations and in contexts where imaging may not be routinely available or recommended. Risk-of-bias assessments were conducted by 1 reviewer, checked by a second reviewer.
Narrative synthesis. Given the heterogeneous nature of study designs and model content, we conducted a narrative synthesis of results. Final models were grouped by the outcome joint of interest (index joint: knee, hip, hand, other, any) to reflect prior evidence of joint-specific risk factors and then synthesized by a specific outcome definition (e.g., radiographic, symptomatic, symptomatic-radiographic). Performance measures were summarized, with calibration relating to the agreement between predicted versus observed risk, discrimination assessing whether patients with the outcome (at a given threshold) have higher risk prediction scores (21). The area under the receiver operating characteristic curve (AUC), as the most commonly reported standard metric for discrimination, was displayed (with 95% confidence intervals [95% CIs], where reported) in a forest plot for each model across derivation (i.e., apparent performance), internal validation (assessment of performance typically within a subset of the original data set, for instance by bootstrapping or cross-validation) and external validation (different sample to derivation) phases (22). An AUC of 0.5 suggests that the model demonstrates no discrimination, ≥0.7 and ≥0.8 were deemed good and excellent, respectively, accepting that such thresholds are quite arbitrary. Predictors included in final models were tabulated and color-coded by the mode of assessment. To reduce the volume of information presented, we grouped different or multiple measurements that had been used to capture the same construct. However, a spreadsheet of this tabulation with minimal grouping of predictors was retained as supplementary data. Throughout the synthesis, studies were generally presented in the order of the year of publication to help discern trends over time. Patients and members of the public were not involved in this systematic review.

RESULTS
The search yielded 10,129 articles. After deduplication, followed by title and then abstract screening, 62 articles were taken forward for full-text screening, of which 20 were eligible for inclusion. A further study was added during data extraction through reference searches, and 5 were added upon rerunning of searches in December 2021. As a result, 26 studies were included in the final analysis ( Figure 1).
General characteristics of included studies. We included 26 eligible studies reporting 31 final multivariable prediction models for incident OA, published between 2010 and 2022, using study populations from 15 unique data sources in the US (9 studies), The Netherlands (8 studies), UK (4 studies), Sweden (2 studies), Canada, China, and Norway ( Table 1). The median prediction horizon was 8 years (range 2-41 years), the median number of participants/joints with the outcome of interest was 121.5 (range 27-12,803), and the median number of predictors included in final models was 6 (range 3-24). Regression analysis was used in 24 models (commonly logistic regression or generalized estimating equations), while 7 involved machine learning approaches (e.g., [deep] neural networks). Internal validation was undertaken for 19 models, 7 were externally validated, and 2 were both internally and externally validated.
Knee OA. Incident radiographic knee OA. Of 23 models predicting incident knee OA, 13 defined the outcome radiographically by plain radiography, typically a Kellgren/Lawrence (K/L) grade of 2 or more, although 2 models selected the more severe threshold of K/L grade of ≥3 (23,24). The median number of participants/joints with the outcome of interest for these 13 models was 95 (range 27-474). The median AUC following internal and external validation was 0.77 (range 0.69-0.82, 6 models) and 0.76 (range 0.60-0.86, 4 models, 6 populations), respectively ( Figure 2). All 13 models included predictors obtainable from clinical assessment. Most common predictors, featuring in >4 final models, were age, sex, body mass index (BMI), and previous knee injury, as well as self-reported pain, stiffness, and function scores from the Western Ontario and McMaster Universities Osteoarthritis Index. Eight models solely used predictors available from clinical assessment (25-30), a further 6 models included predictors sourced from plain radiographs at baseline (23,24,31-34), 3 used magnetic resonance imaging (MRI) (34)(35)(36), and 5 incorporated serum or urinary biomarkers (included predictors for all models are shown in Figure 3) (24,35,(37)(38)(39).
Incident symptomatic OA (frequent knee pain). A symptomatic knee OA outcome was used in 5 models, most commonly defined as the onset of frequent knee pain. The median number of participants/joints with the outcome of interest for these models ranged from 51 to 2,103. The median AUC following internal validation was 0.71 (range 0.70-0.78, 5 models). None of these models used predictors beyond those obtainable from clinical assessment or plain radiography.    Hip OA. Of 4 models derived to predict hip OA (41-44), all originated in The Netherlands and used the same definition: a composite outcome of K/L grade of ≥2 or total hip replacement (THR). The median number of participants/joints was 994.5. The 2 earliest models were derived in the Rotterdam Study-I cohort (41,42), the latter 2 used the Cohort Hip and Cohort Knee (CHECK) study (43,44). All models featured age, sex, and BMI, together with radiographic parameters. A baseline K/L grade (0/1) was used in 3 models (41)(42)(43); the remainder used the presence of joint space narrowing and osteophytes (upon which the K/L grade is calculated) (44). The most recently published models incorporated trabecular bone texture (43) and patented automated hip shape via a machine learning algorithm (44). Discrimination of the latter model in particular was high (the AUC from internal validation was 0.86 [95% CI 0.83-0.90]) with near-perfect calibration (44). External validation was undertaken only by Saberi Hosnijeh et al, finding a reduction in    (Figure 4).
Hand OA. We identified 2 Scandinavian studies (deriving 3 models) predicting the incidence of hand OA (45,46). Magnusson (46) then developed their own prediction models for hand OA in male subjects and female subjects, separately, using diagnostic ICD-9 and ICD-10 codes within the Norwegian National Patient Register. Of note, no improvement in performance was observed with the addition of a genetic risk score in models for either patient sex, or with reproductive and hormonal factors in female participants (46). While both studies underwent internal validation, neither was externally validated.
OA (any joint). Black et al, using the Canadian Primary Sentinel Surveillance Network, was the only study that sought to predict the incidence of OA irrespective of joint (47). They identified 383,117 eligible patients, of whom 12,803 received a billing or problem-list code for OA within 5 years of cohort entry. Their model consisted of 5 predictors routinely collected within EHR data: age, sex, BMI, prior leg injury, and osteoporosis diagnosis. Both discrimination (AUC 0.84 [95% CI 0.83-0.85]) and calibration were good following 10-fold cross-validation ( Figure 4).
Calibration summary. Of 12 models internally validating their model derived for knee OA outcomes, 5 included calibration assessment. Earlier models (27,31) appraised calibration using the Hosmer-Lemeshow statistic, and 4 studies presented and appraised calibration plots (27,29,30,38). Five of the 6 models for knee OA undergoing external validation assessed calibration using Hosmer-Lemeshow; only Fernandes et al (27) presented findings visually.
All models for hip (except the earliest [41]), hand, and anyjoint OA were internally validated. All but 1 model presented calibration plots visually, reporting reasonable or good agreement between expected and observed outcomes. Models by Gielis et al (44) and Black et al (47) demonstrated excellent calibration.
Risk-of-bias summary. The most common sources of potential bias included the extensive use of univariable analysis to select predictors for inclusion in final models, and suboptimal handling of missing data, competing risks, and cohort attrition (domain 4). The inclusion of THR (a measure of both incidence and progression) in composite outcome definitions of incident hip OA was also flagged ( Figure 5).
We judged model applicability in terms of the ability to be implemented at scale in diverse general (adult) populations. Within those terms, applicability was typically judged to be poor. Poor applicability was most often due to the need for predictors obtained by imaging or biologic samples that may not be routinely available, recommended, or affordable for such application (domain 2). Ethnicity and other social stratifiers were seldom considered or included in final models, and we were often unsure whether data sets used to derive and validate models had drawn from a sufficiently diverse population to be applied at scale in general populations (domain 1).

DISCUSSION
We sought to systematically identify and critically evaluate existing multivariable risk prediction models for OA incidence and to consider their potential application at scale in diverse populations to advance individual risk-informed preventive action. Our review identified 26 studies deriving 31 multivariable risk prediction models. A total of 16 models published since 2018 suggests a growing field, attracting machine learning approaches and novel biomarkers, but one that remains centered around a relatively small number of mature cohort data sets of knee (and to a lesser extent hip) OA incidence in high-income countries.
Importantly, our review identifies a general lack of inclusion of social stratifiers beyond age, sex, and occupation-associated risk. Of note, until a study in 2021 by Chan et al (and subsequently Guan et al in 2022) (40,48), no stratifiers relating to ethnicity and markers of deprivation existed, factors that are associated with disparities in both incidence and prevalence of OA (49,50). The lack of such predictors, as well as income, education, and geographic location, may also contribute to a lack of applicability, and usability in, wider populations.
With some exceptions, notably when predicting a future recorded diagnosis of OA across very long prediction horizons, model discrimination after validation ranged from an AUC of 0.70 to 0.85. This range of performance relates to heterogeneous models with prediction horizons from 2 to 12 years and predictors whose collection and processing varies in cost and complexity. In several models undergoing internal or external validation, calibration was either not reported or relied on the Hosmer-Lemeshow test statistic, which is recognized as problematic and no longer recommended (51). Better approaches, however, including the visual display of calibration plots (29,38,(42)(43)(44)(45)(46)(47) and reported intercept and slope (38,43,47) were used in several more recent studies. Poor calibration of some models was attributed to the inherent unpredictability of incident OA over very long prediction horizons (45), and also to challenges in identifying suitably comparable cohort data sets for external validation (27,42). We identified no recent examples of externally validated models supported by moderate or strong evidence of good calibration. Acknowledging that the development of a single prediction model for OA may be challenging, particularly across different target populations, within different health care systems, or across long prediction horizons, from adolescence to disease onset, is important. Predicting individual risk and preventive intervention remain achievable but may require several models in different settings and contexts.
Amid our critical appraisals of risk of bias and areas for methodologic improvement were several positives: the use of internal or external validation was common, greatly facilitated by data sharing, the foresight to design overlapping data points across different cohort studies, the shift toward more careful evaluation of model calibration, and a common practice of including certain core predictors (age, sex, BMI, previous injury). We would also encourage others to emulate, where possible, the initial approach of Johnsen et al to testing and adapting a previously published model rather than assuming the need to derive another new model (46).
We found little evidence of patient and public involvement and engagement in included studies. This may contribute to a lack of clarity on potential applications and routes to implementation, and studies may be strengthened with clear rationale statements alongside integration of patient and public involvement and engagement, for instance by following the Guidance for Reporting Involvement of Patients and the Public (52). Our own review can be criticized on this point, limited by being an unfunded project, with no means for required remuneration. Lack of patient and public involvement is an area for future strengthening.
Our prospectively registered review used a replicable search strategy without language restriction in the search phase across 3 electronic databases, which was rerun prior to submission and was supplemented by searches of reference lists and conference abstracts. Pairs of reviewers working independently and using recommended checklists and risk-of-bias tools performed study selection, data extraction, and risk-of-bias assessments. Our review has several limitations. First, risk-of-bias tools specifically for prediction modeling studies using machine learning techniques were under development at the time of our review (53). Second, several studies derived multiple models for the same outcome. The designation of a final model relied on 2 reviewers' independent judgement based first on the authors' description but was not necessarily the best performing or most applicable model reported. Third, while knowing all of the candidate predictors considered in model development would be of interest, this information was often lacking or partially reported. We opted not to try to synthesize this information. We also refrained from attempting to calculate events per variable for each model because of lack of available information on candidate predictors and because EPV is no longer recommended as a guide to sample size (54,55). Fourth, we did not undertake meta-analysis due to study heterogeneity, nor meta-regression due to an insufficient number of models. Finally, models were developed by their authors for many reasons. Models judged by us to have low applicability to large-scale use in diverse general populations may be highly applicable for other purposes, such as enriched recruitment to clinical trials or within selected clinical settings.
We are unaware of any other previously published review of multivariable risk prediction models for OA incidence with which to compare our findings, although we note a published protocol of a review in development relating to prognostic models for knee OA (56). Reviews in other fields have found similar concerns over methodologic quality and applicability of multivariable prediction models for disease incidence (57)(58)(59)(60). An excess of model development and a lack of rigorous external validation by independent research teams is a recurrent theme. In the more established field of cardiovascular risk prediction, use of registry and EHR data sets appears more common, and efforts are underway to adapt models for application in low-and middle-income countries (4). Such attempts may signal directions for future development of individual risk prediction in OA. Challenges on the validity and completeness of coding and the availability of information on important predictors within routine EHR data are well recognized. The approach of Black et al (47), however, suggests that mitigating some of these may be possible, to produce prediction models with good performance and a prospect of implementation within existing national health systems. However, whether routine EHR data can support accurate risk prediction models is unclear, specifically for hip OA and hand OA. Aspects of hip morphology appear to add important predictive value but will have limited availability in routine records for general populations. Furthermore, the consistent use of composite outcomes including THR may limit both applicability and accuracy in predicting incident disease, an implication for hip OA model development that may be highlighted following more extensive external validation. Substantial under-and misdiagnosis of hand OA poses a different challenge.
Johnsen et al (46), in their separate prediction models of incident hand OA in both males and females, did not find a significant improvement in model performance with the addition of a genetic risk score. Genetic association within OA is a growing field, with ongoing identification of associated variants (61). While the predictive value of these novel variants remains unknown, we believe a model feasible for widespread implementation in clinical practice should use routinely available predictors. Furthermore, previous literature suggests that the accurate prediction of outcomes requires associations of the strength rarely observed in studies (62). Consequently, the addition of predictors such as genetic risk scores to core predictors such as age may not significantly change model performance, and may explain the apparent null result in the models of Johnsen et al (46).
Our review excluded several studies that we feel deserve specific mention. We excluded studies that relied solely on joint replacement as the outcome because of the risk of conflating predictors of incidence and progression. However, the separation of incidence from progression in OA can be contested. Approaches to modeling changes in symptom and disease severity, classifying cohort enrollment, or censorship as a spectrum rather than binary events, such as by Halilaj et al (63) and Widera et al (64), may still contain relevant information. In addition, the linked studies of Losina et al (65) and Michl et al (66) provide evidence that is highly relevant to introducing individual risk models for OA in 1 scalable format: patient self-evaluation using an online OA risk calculator. Of note, this calculator used relatively simple-to-report predictors based on the earlier Nottingham risk prediction model, derived by Zhang et al (26). Beyond these studies, there is a lack of evaluation of the impact of risk prediction models for OA used in clinical practice.
In summary, we identified 31 multivariable prediction models for OA from 26 published studies. While interest is growing among researchers, applicability to clinical practice in diverse general populations is often lacking, and only a paucity of evaluation exists of the impact of implementation. We suggest that models may benefit from clearly stating their rationale, and by integrating patient and public involvement and engagement and predictors such as ethnicity. Furthermore, the progression toward viable risk prediction models for OA that are applicable across a number of settings would be aided with a focus on routinely available predictors and with wider external validation of models in varied populations. Last, growing interest in machine learning techniques, as well as the classification of OA disease as a progression rather than dichotomous incidence, warrants updated research guidelines to better appraise such innovative approaches.