Deep learning based detection of osteophytes in radiographs and magnetic resonance imagings of the knee using 2D and 3D morphology

In this study, we investigated the discriminative capacity of knee morphology in automatic detection of osteophytes defined by the Osteoarthritis Research Society International atlas, using X‐ray and magnetic resonance imaging (MRI) data. For the X‐ray analysis, we developed a deep learning (DL) based model to segment femur and tibia. In case of MRIs, we utilized previously validated segmentations of femur, tibia, corresponding cartilage tissues, and menisci. Osteophyte detection was performed using DL models in four compartments: medial femur (FM), lateral femur (FL), medial tibia (TM), and lateral tibia (TL). To analyze the confounding effects of soft tissues, we investigated their morphology in combination with bones, including bones+cartilage, bones+menisci, and all the tissues. From X‐ray‐based 2D morphology, the models yielded balanced accuracy of 0.73, 0.69, 0.74, and 0.74 for FM, FL, TM, TL, respectively. Using 3D bone morphology from MRI, balanced accuracy was 0.80, 0.77, 0.71, and 0.76, respectively. The performance was higher than in 2D for all the compartments except for TM, with significant improvements observed for femoral compartments. Adding menisci or cartilage morphology consistently improved balanced accuracy in TM, with the greatest improvement seen for small osteophyte. Otherwise, the models performed similarly to bones‐only. Our experiments demonstrated that MRI‐based models show higher detection capability than X‐ray based models for identifying knee osteophytes. This study highlighted the feasibility of automated osteophyte detection from X‐ray and MRI data and suggested further need for development of osteophyte assessment criteria in addition to OARSI, particularly, for early osteophytic changes.


| INTRODUCTION
Knee osteoarthritis (OA) is the most common musculoskeletal degenerative disease.5][6] For instance, presence of femoral OSTs, particularly, in the lateral compartment, was found to be significantly associated with increased pain. 7Patients with large OSTs have been shown to experience higher joint space narrowing. 8reover, OSTs often progress alongside bone marrow lesions, meniscal extrusion, and cartilage defects. 9For assessment of OSTs severity, semi-quantitative scoring systems are commonly used, such as Osteoarthritis Research Society International (OARSI) 10 atlas for X-rays and MRI Osteoarthritis Knee Score (MOAKS) for MRIs. 11It has been shown that MRI has a high sensitivity for detecting bony changes and excellent contrast for imaging soft tissues, 12,13 making it a potentially beneficial modality for OST detection and complete joint assessment.However, its spatial resolution and inferior contrast in bone imaging, compared to X-rays, may limit its utility.Nonetheless, the volumetric nature of MRI imaging has shown promise in improving the ability to detect OSTs. 12,14e OARSI and MOAKS semi-quantitative grading systems follow similar criteria.Podlipská et al. 15 found a moderate to very good agreement between OARSI OST detection (at cutoff grade 1) and MOAKS-based OST scores.Additionally, Kim et al. 16 reported a good agreement between the guidelines for OST scoring in lateral compartments.While both grading guidelines are widely used by clinicians, manual grading of OSTs suffers from rather high inter-rater reliability, thus, affecting the objectiveness of joint health evaluation. 17tomated tissue quantification methods may help to reduce reader subjectivity to improve the accuracy and reproducibility of OST assessment.Besides, such methods may enable analysis of OSTs in large cohorts, which otherwise would require tremendous amount of expert labor.Despite the significant associations of OSTs and multiple OA-related symptoms mentioned above, a few studies have focused on quantifying OSTs automatically and, specifically, from MRI. [18][19][20][21][22] Antony et al. 20 employed knee X-rays to automatically predict knee OST severity using a Deep Learning (DL) model.
Similarly, Tiulpin et al. 21developed a DL-based model to predict OARSI OST scores and Kellgren-Lawrence grade (KLG) in a multilabel classification setting. 21These prior DL-based models considered various factors, including bone texture.Only a few studies focused on detecting OSTs using the morphology. 19,22Thomson et al. 19 investigated 2D morphology of the knee bones and the X-ray-based texture, both jointly and in isolation, for analyzing OSTs.They showed that including knee morphology information improves the performance of texture-based model.Another study employed Statistical Shape Modelling (SSM) approach to detect OSTs. 23 design, SSMs can be inefficient for accurate OST detection and analysis with large datasets.It may be attributed to the significant interindividual variability in knee shapes within the population and relatively small size of OSTs.In contrast, DL techniques may provide greater sensitivity and specificity for OST detection.However, one essential question remains-whether 3D shape features provide additional predictive power, compared to 2D morphology, in detecting OSTs.Furthermore, it is crucial to consider how the shape and location of other tissues confound and indicate the presence of OSTs.
In this study, we investigated discriminative capacity of knee joint tissue morphology for detecting tibiofemoral OSTs, both from X-ray and MRI data.When we mention 'detection', we are referring to identifying the presence (OARSI grade 1 or higher) or absence of OSTs.The contribution of our study is fourfold.First, we developed a convolutional neural network (CNN) model to segment knee bones from X-ray data.Second, we created a DL-based method for automatic detection of OSTs from knee morphology.Third, we conducted a comprehensive comparison of X-ray and MRI-based morphology in automated OSTs detection across various KLG and OARSI OST levels.Lastly, we investigated whether soft tissues of the knee joint visualized with MRI confound and add value to OST detection.Our results suggested the need for further development of osteophyte assessment criteria, in addition to OARSI.

| MATERIALS AND METHODS
In this study, we developed a CNN-based framework to detect OSTs from the morphology of knee joint tissues.We first segmented the tissues automatically from knee X-ray and MR images and then trained DL-based models to classify knees as having or not having OST based on OARSI atlas.The overall pipeline is summarized in Figure 1.

| Data
We used publicly available data from the Osteoarthritis Initiative (OAI; https://nda.nih.gov/oai), a large longitudinal cohort focused on the development of biomarkers associated with knee OA onset and progression.In OAI, X-ray images were acquired using fixed-flexion bilateral Posterior-Anterior weight-bearing protocol with Synaflexer positioning frame (with pixel spacing ranging 0.10−0.20 mm). 24MR images were acquired with a 3 T MRI scanner (Siemens MAGNETOM Trio, Erlangen, Germany) and included, among others, 3D sagittal dual echo in steady-state (DESS) sequence (repetition time of 16.3 ms; echo time of 4.7 ms; slice thickness 0.7 mm; voxel spacing 0.365 mm × 0.365 mm; acquisition matrix 384 × 307). 25Semi-quantitative gradings using OARSI and MOAKS guidelines are available for subsets of the complete cohort.Specifically, OARSI measurements are available for 5539 knees, representing all OAI participants who underwent knee X-rays with definite radiographic OA in at least one knee.In comparison, MOAKS measurements covered a smaller subset of the cohort, with 600 knees.Due to the sample size, we employed OARSI assessments in our analysis.
We used a data set of 5349 knees (2778 subjects) from the baseline visit of OAI.The complete data selection flowchart is presented in Supp.Regarding OA severity, the distribution of KLGs was as follows: KLG0: 752 (14%), KLG1: 831 (16%), KLG2: 2299 (43%), KLG3: 1190 (22%), and KLG4: 277 (5%).OARSI OST assessments are available for medial femoral (FM), lateral femoral (FL), medial tibial (TM), and lateral tibial (TL) compartments separately, as shown in Figure 2A. 10 The scores were assigned by two readers who evaluated joint condition from radiographs, referencing OARSI atlas. 26The scoring was based on OST size, ranging from 0 to 3, corresponding to "no", "small", "medium", and "large" OST sizes.Figure 2 shows different grades on an X-ray (Figure 2B) and an MRI sample (Figure 2C).The distribution of grades in our study is shown in Table 1.For FM, FL, and TL compartments, OST grade 0 was the most prevalent in the data set.
The distribution of OST cases (grades greater than zero) represented 45%, 39%, and 38%, respectively, indicating an imbalanced data set.In the case of TM, 61% of cases had OSTs, with the majority being grade 1.
We obtained knee morphology from X-ray images through a deep learning-based bone segmentation pipeline.As for the MRIs, automatic segmentations previously generated and validated by Tack et al. 27 were used.MRI-based masks were derived from 3D DESS MRI data of the OAI through a multi-stage pipeline including 2D and 3D CNNs, as well as statistical shape refinement.The segmentations included separate masks for femur, tibia, corresponding cartilage tissues, and menisci (Figure 1B) (Pipeline details in Supp.Section S1).

| Data preprocessing
We performed three essential data pre-processing steps in this study.
First, the X-ray images were normalized for the development of our X-ray segmentation model.Second, we extracted the region of interest (ROI) from both X-ray and MRI segmentation masks.Third, we divided the ROIs into lateral and medial compartments.(The exact preprocessing techniques in Supp.Section S2).

| Knee bone segmentation from X-ray images
We developed a bone segmentation model using deep learning (Figures 1A and 2).Femur and tibia were manually segmented from 200 X-rays using Adobe Photoshop (Adobe Inc.) by a person with

| Osteophyte detection from knee morphology
In this stage, we used the automatic segmentations to detect OST.
We framed the problem as binary classification and defined the target based on the presence of OSTs.The positive class was defined as 'OARSI OST grade > 0', while the negative class as 'OARSI OST grade = 0'.We employed two different CNN architectures of ResNet-18 and ResNet-10 to detect OSTs from 2D and 3D masks, respectively.We held out 20% of the data as the test set and the rest as the train-validation set.To train our models, we used stratified fivefold cross-validation, where each fold had the same ratio of positive to negative samples.For evaluation of  the models, we used the hold-out set of knees, maintaining a distribution of OST severity similar to the complete data set.
Classification performance of the models was assessed using the area under the ROC curve (AUC), average precision (AP), balanced accuracy (BA), and sensitivity.
To detect OST in medial compartments (FM and TM), medial part of the bone mask and for the lateral ones (FL and TL), corresponding part was fed to the network.Subsequently, the models were developed to detect OSTs from 3D knee morphology obtained from the MRI data.Soft tissue changes such as cartilage damage and meniscal extrusion are known to be associated with the presence of OSTs. 9,29However, the impact of these confounding factors on the detection of OSTs is not well understood.
Therefore, we investigated various combinations of tissues as the model input to clarify the contribution of each tissue type in OST detection.The combinations were defined as follows: bone (BON), bone+cartilage (BCAR), bone+menisci (BMEN), and all aforementioned tissues (All).Implementation-wise, the input images comprised three channels, each representing a specific tissue type.For BON model, only the first channel was used to indicate femur and tibia voxels with a pixel value of 1, while the other two channels were zeroed.For BMEN model, we set menisci mask in the second channel.For BCAR model, the cartilage mask occupied the second channel.Finally, for 'All' model, bone, cartilage, and menisci pixels were included in the first, the second, and the third channels, respectively.A set of 16 models was trained-one for each of the four compartments and each of the four listed tissue combinations.

| Training and evaluation
We employed three different models for three distinct tasks: a U-Net-based model 28 for X-ray segmentation, 2D ResNet-18 for X-ray-based mask analysis, and 3D ResNet-10 for MRI-based mask analysis.In both settings, a stratified split was done to allocate 80% and 20% of the data to the train-validation and testing sets, respectively.It was based on the class distribution of each corresponding compartment.Therefore, the test sets differed for various compartments.However, they were identical for the X-ray and MRI models for a certain compartment.This approach ensured that both the train and test sets had the same OST distribution (Supp.Figure S4).During the model selection phase, we calculated the out-of-fold ROC AUC value on the validation sets to choose the best-performing model.At the test phase, predictions of the fold-wise models were averaged.
We assessed model performance using AP, ROC AUC, and BA.
Additionally, we conducted a paired permutation test to determine any statistically significant differences between the X-ray and MRI-based models concerning ROC AUC, AP, and BA.We also calculated the sensitivity for each OST grade based on extended confusion matrices, providing insights into model performance across different levels of OST severity.

| Bone segmentation
An X-ray image example overlayed with predicted segmentation masks is illustrated in Figure 1A.Our segmentation model resulted in ASDs of 0.01 ± 0.02 mm for femur and 0.02 ± 0.08 mm for tibia.
Considering the pixel spacing with a range between 0.20 and 0.40, the results indicate that our model was able to segment the bone boundaries with very high accuracy.In terms of DSC, the model showed the scores of 98.88 ± 0.01 for femur and 98.59 ± 0.01 for tibia.These values indicate a very high degree of similarity between the predicted and ground truth masks (Supp.Figure . 5).The higher mean and standard deviation values of ASD in tibia segmentation suggest that this anatomical region was more challenging for the model than femur.These results could be partially attributed to the presence of fibula, which overlaps with tibia in posteroanterior X-rays and may cause confusion during the segmentation.This is supported by visual analysis of the automatic segmentation results (successful and failed segmentations), shown in Supp. Figure S6.Other observed factors that contributed to reduce segmentation quality in certain samples were lower image resolution and overall lower image contrast.

| Comparison of 2D and 3D bone morphology in osteophyte detection
The X-ray and MRI-based models were evaluated on the corresponding test sets.Both models demonstrated moderate to high performance in OST detection, with BA values ranging from 0.69 to 0.80.In detection of femoral OSTs, X-ray-based models achieved the average AP of 0.81, AUC of 0.80, and BA of 0.73 for the medial part, and AP of 0.68, AUC of 0.75, and BA of 0.69 for the lateral part.The MRIbased models achieved significantly higher performance for femoral compartments: average AP of 0.86, AUC of 0.87, and BA of 0.80 for the medial part, and AP of 0.79, AUC of 0.83, and BA of 0.77 for the lateral part.The results suggest that the greater curvature of femur may result in an increased likelihood of overlap between the normal bone margins of the femur and its OSTs in 2D projection.
For detection of OST in TM and TL, the 2D and 3D models yielded high performance with BA values ranged 0.71-0.74.The models performed similarly, and the metrics were in the following ranges: AP of 0.86-0.88,AUC of 0.78-0.80,and BA of 0.71-0.74for the medial part, and AP of 0.76-0.79,AUC of 0.82-0.84,and BA of 0.74-0.76for the lateral part.The results suggest that X-ray-based models can be used interchangeably for detecting OSTs in tibia.The positioning during the acquisition in which tibial plateaus are aligned with the beam, may contribute to higher readability of OSTs in these compartments.
We obtained a statistically significant performance difference in terms of AUC between 3D femoral models and 2D femoral models (p-value = 0.0005 and 0.0005 for FL and FM, respectively).Similarly, the p-values for AP were 0.001 and 0.0005, and for BA were 0.0005 and 0.0005 for FM and FL, respectively.However, there were no statistically significant differences in the AUCs between the X-ray and MRI-based models for TL and TM regions, with p-values of 0.13 and 0.27, respectively.Similarly, the p-values for AP were 0.05 and 0.20 and for BA were 0.13 and 0.11, for TL and TM, respectively.The full comparisons of the models are summarized in Table 2. Overall, both the 2D and the 3D models for femoral OST detection exhibited slightly higher performance in the medial part than the lateral part.In the tibial region, the predictive power was greater with the 3D model in the lateral compared to the medial part, while being similar with the 2D model.For larger OSTs, all models across all compartments demonstrated consistently high sensitivity values ranging from very good to excellent (0.84-1.00).This indicates that the addition of menisci or cartilage features may not significantly enhance detection beyond bone morphology alone.

| DISCUSSION
In this study, we investigated discriminative capacity of knee morphology to detect OSTs from X-ray and MRI data.Knee X-ray images were segmented using the developed DL model.Note: The numbers represent the mean (standard deviation) values of average precision (AP), area under the ROC curve (AUC), and balanced accuracy (BA) metrics.Significantly higher scores (p < 0.001) are highlighted with an asterisk (*).Additionally presented are the sensitivity values for OARSI OST grade 1-3 -Sns1-3, respectively.The higher scores are highlighted in boldface.

Figure 1 .
The participants had an average age of 62.3 years (range: 45.0-79.0)with a standard deviation of 9.0, and an average BMI of 29.5 (range: 18.2-48.7)with a standard deviation of 4.8.Male/female ratio was ∼0.70 (1145 males and 1633 females).

F
I G U R E 1 Schematic view of our workflow.(A) Detection of osteophytes (OSTs) from X-ray-based 2D bone morphology.(A-1) Preprocessing for bone segmentation.The ROIs were localized and aligned based on landmarks produced by BoneFinder software.(A-2) Bone segmentation using deep learning.We developed a modified U-Net model to perform the segmentation automatically.(A-3) Pre-processing for OST detection.We cropped the ROI around the joint area and subsequently divided the masks into the lateral and medial parts.(A-4) Detection of OSTs.We used a set of 2D ResNet-18 CNN models, developed for each compartment independently.(B) Detection of OSTs from MRI-based 3D morphology.(B-1) We used the validated automatic segmentations produced by Tack et al. from sagittal DESS MRI data.(B-2) Pre-processing for OST detection.We cropped the ROI around the joint area and split the masks into the lateral and medial parts consistently with the X-ray data.(B-3) Detection of OSTs.We used a set of 3D ResNet-10 CNN models, one per compartment.

3
years of experience in musculoskeletal image analysis, emphasizing the identification of osteophytic regions.Sample selection maintained a distribution of OSTs grades like the original data.The annotated samples were split into training and evaluation sets.We employed a U-Net-based model 28 with modifications, featuring seven levels in both the encoder and decoder.A fivefold crossvalidation training strategy was utilized, optimizing the model to minimize the loss function, defined as a combination of generalized Dice loss and Boundary loss to enhance boundary accuracy.The best performing models were selected based on minimum validation loss.Subsequently, we automatically segmented the remaining knees in the complete data set, averaging class-wise softmax predictions from five segmentation models.The predicted class for each pixel was determined through thresholding at 0.5.The segmentation performance was assessed using Average Surface Distance (ASD) and Dice Similarity Coefficient (DSC).

F I G U R E 2
Examples of different OARSI osteophyte (OST) grades.(A) OARSI OSTs are assessed within four anatomical compartments of the knee joint: Medial Femur (FM), Lateral Femur (FL), Medial Tibia (TM), and Lateral Tibia (TL).OSTs with severity from grade 0 to grade 3 are visualized as seen in X-ray (B) and coronal MR image reprojection (C).White arrows point at distinct OST features.The severity of OSTs can be visually appreciated by their size and shape.T A B L E 1 Compartment-wise distribution of OARSI osteophyte (OST) grades in the data set.
, as compared to 0.53 in the BON model.The results highlight the importance of menisci morphology in the detection of smaller OST at TM. Bone morphology alone showed poor sensitivity values (ranging from 0.53 to 0.56) in detecting smallsized FL and FM OSTs, and a moderate value of 0.65 for TL OST grade 1. Incorporating all soft tissue features only marginally improved the models' predictive power, by up to 0.03.
Subsequently, we analyzed the added value of integrating soft tissue morphology in detection of OSTs from MRI-based masks.All models demonstrated high performance in OST detection, with BA values ranging from 0.71 to 0.81.The complete quantitative comparison of the models is provided in Table3.
T A B L E 2 Comparison of X-ray and MRI-based models for osteophyte (OST) detection.