Machine learning prediction models in orthopedic surgery: A systematic review in transparent reporting

Abstract Machine learning (ML) studies are becoming increasingly popular in orthopedics but lack a critically appraisal of their adherence to peer‐reviewed guidelines. The objective of this review was to (1) evaluate quality and transparent reporting of ML prediction models in orthopedic surgery based on the transparent reporting of multivariable prediction models for individual prognosis or diagnosis (TRIPOD), and (2) assess risk of bias with the Prediction model Risk Of Bias ASsessment Tool. A systematic review was performed to identify all ML prediction studies published in orthopedic surgery through June 18th, 2020. After screening 7138 studies, 59 studies met the study criteria and were included. Two reviewers independently extracted data and discrepancies were resolved by discussion with at least two additional reviewers present. Across all studies, the overall median completeness for the TRIPOD checklist was 53% (interquartile range 47%–60%). The overall risk of bias was low in 44% (n = 26), high in 41% (n = 24), and unclear in 15% (n = 9). High overall risk of bias was driven by incomplete reporting of performance measures, inadequate handling of missing data, and use of small datasets with inadequate outcome numbers. Although the number of ML studies in orthopedic surgery is increasing rapidly, over 40% of the existing models are at high risk of bias. Furthermore, over half incompletely reported their methods and/or performance measures. Until these issues are adequately addressed to give patients and providers trust in ML models, a considerable gap remains between the development of ML prediction models and their implementation in orthopedic practice.


| INTRODUCTION
Prediction models for orthopedic surgical outcomes based on machine learning (ML) are rapidly emerging. Such models, if adequately reported, can guide treatment decision making, predict adverse outcomes, and streamline perioperative healthcare management.
However, transparent and complete reporting is required to allow the reader to critically assess the presence of bias, facilitate study replication, and correctly interpret study results. Unfortunately, previous studies have suggested that prediction models demonstrate incomplete, untransparent reporting of items, such as study design, patient selection, variable definitions and performance measures. 1,2 To our knowledge, there is no systematic review that has assessed the completeness of reporting for the currently available prognostic ML models in orthopedic surgery.
The transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement was published in 2015 to improve the quality of reporting of prediction models. 3,4 It provides a guideline for essential elements of prediction model studies. The statement is endorsed by over ten leading medical journals and has been cited thousands of times. The prediction model risk of bias assessment tool (PROBAST) was developed to assess risk of bias in prediction models by the Cochrane Prognosis group in 2019, and has been successfully piloted. 5 Both the PROBAST and TRIPOD had yet to be published at the time several ML prediction models for orthopedic surgical outcome were developed; nonetheless, we believe they can be used as benchmarks for measuring quality of reporting and bias even if the prediction models were published before their introduction.
In this systematic review, we (1) evaluate the quality and completeness of reporting of prediction model studies based on ML for prognosis of surgical outcomes in orthopedics according to their adherence to the TRIPOD statement, and (2) assess the risk of bias with the PROBAST.

| Eligibility criteria
Studies were included if they evaluated ML models for any prediction in an orthopedic surgery outcome, such as survival, patient reported outcomes measures (PROMs), or complications. Exclusion criteria were (1) non-ML techniques (such as logistic or linear regression analysis), (2) conference abstracts, (3) non-English studies, (4) lack of full-text, and (5) nonrelevant study types, such as animal studies, letters to the editors, and case-reports. Orthopedic specialties were defined as any operation for patients with musculoskeletal disorders.

| Data extraction
Six reviewers (PTO, OQG, AL, PT, NDK, and BBJ) independently assessed the first 10% of studies. All extracted data were then discussed during a group session with the principal investigator (PI) (JHS) to ensure quality and consistency. Any questions about discrepancies in the extracted data were resolved by the PI. After this quality training, the same six reviewers split up in pairs of two and each pair independently assessed the remaining 90% of studies which were evenly distributed among the three formed pairs. Each pair consisted of a research fellow with a medical doctor degree and a medical student. Disagreements within a pair were resolved during a consensus meeting with at least two other reviewers present. All six reviewers and the PI previously worked on and/or published ML prediction models in orthopedic surgical outcomes.
For each included study, we extracted the following information: journal, prospective study design (yes/no), use of national or registry database (yes/no), size of total dataset, number of predictors used in final ML model, predicted outcome, mention of adherence to TRI-POD guideline in study (yes/no), access to ML algorithm (yes/no), TRIPOD items, and PROBAST domains. The TRIPOD items and PROBAST domains are explained in more detail below.
The TRIPOD statement consists of 22 main items, of which two main items (12 and 17) refer to model updating or external validation studies, leaving 20 main items to be extracted for prognostic prediction modeling studies. 4 These main items were transformed into an adherence assessment form by the statement developers. Of the 20 main items, 11 had no subitems (1, 2, 8, 9, 11, 16, 18, 19, 20, 21, and 22), seven were divided into two subitems (e.g., 3a and 3b; 3, 4, 6, 7, 13, 14, and 15), and two into three subitems (e.g,. 5a, 5b, 5c; 5 and 10). Four subitems (10c, 10e, 13c, and 19a) were, together with the two main items (12 and 17), not extracted because they did not refer to developmental studies (e.g., 10c "For validation, describe how the predictions were calculated"; Appendix 2). Hereafter, subitems and main items are defined under one nomenclature "items" (e.g., main item 3 consists of two items; 3a and 3b). In total, 29, 30, or 31 potential items could be assessed per study. This total number of items varied between 29 and 31 because some items could be scored with "not applicable" (e.g., 14b "if nothing on univariable analysis (in methods or results) is reported, score not applicable") and this was excluded when calculating the completeness of reporting. Also, some items could be scored with "referenced" (e.g., item 6a) Referenced was considered "completed" and included when calculating the completeness of reporting.
Each item may consist of multiple elements. Both elements must be scored "yes" for the item to be scored "completed." To calculate the completeness of reporting of TRIPOD items, the number of completely reported TRIPOD items was divided by the total number of TRIPOD items for that study. If a study reported on multiple prediction models (e.g., prediction model for 90-day and 1-year survival), we extracted data only on the best performing model.

| Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis
Among all studies, the overall median completeness for the TRIPOD items was 53% (interquartile range: 47%-60%; see Figure 2 and Appendix 5). Eight items were reported in over 75% of studies and seven items in less than 25% ( Table 2). The abstract (2) and the model-building procedure (10b) were the most poorly reported items with only 3% (2/59). Source of data (4a) was reported in all studies (100%; 59/59).

| Prediction model risk of bias assessment tool
The overall risk of bias was low in 44% (26/59), high in 41% (24/59), and unclear in 15% (9/59) of the studies (Figure 3). The studies that rated highly for overall risk of bias were mainly rated this way due to   report calibration measures. Calibration is an essential element of describing the performance of ML models and its importance has extensively been discussed in earlier reviews. [22][23][24] The frequent omission of calibration renders assessment of performance virtually F I G U R E 2 Overall adherence per TRIPOD item. *All items consisted of 59 datapoints, except for item 5c (58), item 11 (4), and item 14b (45) due to the "Not applicable" option. TRIPOD, transparent reporting of a multivariable prediction model for individual prognosis or diagnosis [Color figure can be viewed at wileyonlinelibrary.com] T A B L E 2 Individual TRIPOD items sorted by completeness of reporting over 75% and under 25% Complete reporting > 75% Complete reporting < 25% TRIPOD item TRIPOD description % (n) TRIPOD item TRIPOD description % (n) 4a Describe the study design or source of data (e.g., randomized trial, cohort, or registry data).

(59)
10b Specify type of model, all model-building procedures (including any predictor selection), and method for internal validation.
3 (2) 19b Give an overall interpretation of the results considering objectives, limitations, results from similar studies and other relevant evidence.

(58)
2 Provide a summary of objectives, study design, setting, participants, sample size, predictors, outcome, statistical analysis, results, and conclusions.

(2)
18 Discuss any limitations of the study (such as nonrepresentative sample, few events per predictor, missing data).

(57)
15a Present the full prediction model to allow predictions for individuals (i.e., all regression coefficients, and model intercept or baseline survival at a given time point).
8 (5) 3b Specify the objectives, including whether the study describes the development of the model.

(56)
13a Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful.
19 (11) 3a Explain the medical context and rationale for developing the multivariable prediction model, including references to existing models.

(50)
14a Specify the number of participants and outcome events in each analysis. 20 (12) 5b Describe eligibility criteria for participants.

(49)
1 Identify the study as developing a multivariable prediction model, the target population, and the outcome to be predicted. 20 (12) 5c a Give details of treatments received, if relevant.

(48)
14b a If done, report the unadjusted association between each candidate predictor and outcome.
24 (11) 8 Explain how the study size was arrived at.

(45)
Abbreviation: TRIPOD, transparent reporting of a multivariable prediction model for individual prognosis or diagnosis. a All items consisted of 59 datapoints, except for 5c (58) and 14b (45) due to "Not applicable" option.
impossible and is in line with previous literature on prediction models. 25,26 Finally, the small sample sizes with often small outcome numbers introduce risk of overfitting. Overfitting refers to including too many prognostic factors relative to the amount of cases. This may improve the prediction performance in the data set but reduces the generalizability outside the training data set. While the use of national consists of 12 elements which all have to be fulfilled in order for item 2 to be marked as "completely reported." Also, authors as well as journal reviewers might have good reasons to exclude certain TRIPOD information. For example, one may not report regression coefficients in item 15 "model specifications" or provide "the potential clinical use of the model" in item 20 if they believe that their prediction model is not fit for clinical use. Nonetheless, we scored these items in this current study as "incomplete." This rigorous method of scoring is in line with the nature of the TRIPOD guideline and is deemed essential for consistent and transparent reporting of prediction models. In addition, most journals require a maximum word count or prescribe specific requirement. These restrictions could potentially prevent authors from including all 12 elements.
Despite these limitations, this review provides the first comprehensive overview of completeness of transparent reporting for ML prediction models in orthopedics. Illustrating poor reporting of TRIPOD items identifies current hurdles and may improve future transparent reporting.

| CONCLUSION
Prognostic surgical outcome models are rapidly entering the orthopedic field to guide treatment decision making. This review indicates that numerous studies display poor reporting and are at high risk of bias. Future studies aimed at developing prognostic models should explicitly address the concerns raised, such as incomplete reporting of performance measures, inadequate handling of missing data, and not providing means to make individual predictions. Collaboration for sharing data and expertize is needed not just for developing more reliable prediction models, but also for validating current models.
Methodological guidance, such as the TRIPOD statement should be followed, for unreliable prediction models can cause more harm than benefit when guiding medical decision making.