The Effect of Image Resampling on the Performance of Radiomics‐Based Artificial Intelligence in Multicenter Prostate MRI

Single center MRI radiomics models are sensitive to data heterogeneity, limiting the diagnostic capabilities of current prostate cancer (PCa) radiomics models.

However, its performance is dependent on the level of expertise of the radiologist. 3,4Despite extensive experience, radiologists have been reported to have low positive predictive value of 35% and 49% for Prostate Imaging-Reporting and Data System (PI-RADS) scores ≥3 and ≥4, respectively, with a considerable variation observed across different centers. 5ultiple machine learning related solutions have been brought forward to improve the MRI-based detection of crPCa lesions. 6Among these, radiomics has demonstrated a high diagnostic performance. 7,80][11] This limitation hampers the diagnostic capabilities of radiomics PCa models, considering the importance of large multicenter datasets to create high performing and generalizable radiomics models. 10,12ata heterogeneity leads to diagnostically irrelevant differences in radiomics features that reduce final model performance. 10,13,14It has been reported that pixel spacing and slice thickness are the most important feature variation factors in multicenter computed tomography radiomics. 15It has been demonstrated that MRI radiomics are similarly affected by variations in pixel size and slice thickness. 16We hypothesized that differences in acquisition parameters affecting voxel dimensions play a significant role in radiomics feature variability in MRI of the prostate, and, consequently, affect diagnostic performance.One previous study recommended using isotropic voxel or pixel spacings to create rotationally invariant texture features. 9However, no standardized method was provided for this isotropic recommendation. 9Additionally, we hypothesized that adopting normalized voxel spacing for T2-weighted (T2W) and diffusion-weighted images (DWI) each, will improve multicenter radiomics model performance.Therefore, the purpose of this study is to investigate the effect and settings of image resampling normalization on radiomics artificial intelligence (AI) in a multi-center prostate MRI setting.

Patient Data
This study received approval from the institutional review board of the two tertiary academic centers and waived the need for informed consent due to its retrospective nature.The patient data included in this study consisted of 930 patients who were scanned in nine different medical centers.The centers consisted of two tertiary care academic centers while the remaining seven concerned non-academic centers.MRI scans were performed between 2014 and 2020 on 1.5T or 3T systems (Philips Healthcare, Best, The Netherlands: Ingenia 3T, Achieva 1.5T, and Intera 1.5T; or Siemens Healthineers, Erlangen, Germany: Skyra 3T, Prisma 3T, Aera 1.5T, Avanto 1.5T, and Espree 1.5T).MRI protocols followed the settings recommended in the PI-RADS guidelines. 17e reference standard was defined according to the ISUP and obtained through biopsy.Lesion labels were categorized as either crPCa (ISUP grade ≥ 2) or nonrelevant entities.All lesion labels were determined based on ISUP grades obtained with: MRI-TRUS fusion biopsy, TRUS biopsy, cognitive fusion biopsy, or prostatectomy.A radiologist (D.Y.) with 8 years of experience in prostate MRI assigned PI-RADS scores to each lesion, and created lesion segmentations using ITK-Snap version 3.8.0. 17,18Patient characteristics including age, prostate specific antigen (PSA) and PSA density were extracted from patient files.Detailed descriptions about institutions, vendors, scanners, acquisition protocols, lesion grading, and pathology results are described in a previous study with a different purpose and research question that used the same 930 patients. 19his study included patients suspected of harboring crPCa whom underwent prostate MRI and subsequent biopsy or prostatectomy.Patients and their lesions were excluded if tissue sampling was performed more than 6 months after the MRI examination.Secondly, patients were excluded if they had incomplete bpMRI studies that did not adhere to the PI-RADS version 1 guidelines, due to the absence of required sequences, i.e., T2W acquisition (sagittal, coronal, transverse) and DWI acquisition (b-values: 50, 400, 800, 1400 s/mm 2 , with mono-exponentially calculated apparent diffusion coefficient [ADC] map).Considering the limited role of dynamic contrast-enhanced imaging in current prostate MRI guidelines, we excluded it from this study.For patients who underwent multiparametric MRI, only the aforementioned T2W and DWI sequences were extracted.

Resampling Method and Image Resolution
Voxel spacings without resampling were collected for the T2W and DWI sequences of all potentially eligible lesions (see Supplementary Material 1).The dataset was considered clinically representative (multicenter & multiscanner), and the collected voxel spacings were used to select the resampling voxel spacing.The three most frequent voxel spacings for T2W and DWI were selected to minimize the degree of interpolation, thereby reducing resampling artifacts. 20he selected isotropic voxel spacings for T2W images were 0.35, 0.5, and 0.8, while for DWI images, the voxel spacings were: 1.37, 2, and 2.5 mm.Resampling was performed by a selection of four commonly image interpolation methods provided by SimpleITK and available in Pyradiomics. 21,22][25] Each algorithm used a different approach to represent the original data raster with a new pixel size in 2D or voxel size in 3D.The inclusion of the first three interpolation methods was based on their extensive usage for image interpolation, while the addition of windowed-sinc interpolation was motivated by its aim of minimizing aliasing artifacts. 26Of the windowed-sinc interpolation options the Blackman window was selected because it was most frequently recommended. 27

Radiomics Extraction and Model Development
Radiomics features were extracted from a deep learning masked (DLM) auto-fixed volume of interest (VOI), which was used to automatically determined a suitable VOI around a given lesion coordinate. 19Previous studies have shown that DLM was faster and had better diagnostic performance than manual segmentation, ensuring a reproducible radiomics segmentation by only selecting voxels inside the prostate. 19The coordinates for auto-fixed VOI placement in each lesion were determined as the lowest ADC voxel within the corresponding lesion segmentation.The following settings were varied in data pre-processing: 1) six spacing options (T2W: 0.35, 0.5, 0.8 mm; DWI: 1.37, 2, 2.5 mm), 2) four interpolation algorithms (nearest neighbor, linear, Bspline and Blackman windowed-sinc), and 3) two dimensions (2D & 3D).Radiomics datasets were created separately for T2W and DWI.The T2W datasets contained radiomics features extracted from axial, sagittal, and coronal T2W.The DWI datasets contained features extracted from b50, b400, b800, b1400, and ADC maps.The same VOI was used for both 2D and 3D feature extraction.In the 2D approach, slice-by-slice feature extraction was performed on each individual slice, while the 3D approach involved a single extraction across the entire volume.
All radiomics models were developed with the same gradient boosting model, hyperparameter optimization, and model selection approach, allowing to study only the effects of resampling spacing, interpolation, dimension and MRI sequences.The lesions were randomly divided into three sets, where 70% was allocated for training, 10% for validation, and 20% for testing.The training dataset was used for the optimization of the feature selection and model hyperparameters, while the validation dataset was used for model selection, by comparing the validation performance between models.The selected models with the highest validation performance were used to predict on the hold-out test data to determine the final diagnostic performance.Radiomics feature extraction was performed using pyradiomics v3.0.1.Detailed information on the radiomics settings and model development can be found in Supplementary Material 2.

Optimization of the Resampling Strategy
Since T2W and DWI were acquired at different original spacings, separate resampling strategies were employed for each sequence.This aimed to optimize their combined diagnostic performance.Therefore, a two-staged approach was adopted.In the first step, a uniparametric preselection was performed, by selecting the four best performing resampling strategies for 2D T2W, 3D T2W, 2D DWI, and 3D DWI, each.For the second step, the preselected T2W and DWI strategies were combined and jointly evaluated in a full bpMRI radiomics model.This approach reduced the time and computationally expensive search for the combination of 48 Â 48 model variations (6 spacing options Â 4 interpolation methods Â 2 dimensions).
The uniparametric experiment evaluated all combinations per sequence and interpolation method.The validation scores were then ranked, and the highest scoring resampling size option was retained for each interpolation, dimensionality, and sequence.This resulted in four 2D T2W models, four 2D DWI models, four 3D T2W models, and four 3D DWI models.
The multiparametric experiment assessed the validation score of both bpMRI sequences.By exploring all possible combinations of the eight 2D options, a total of 16 2D bpMRI models were generated.Similarly, the combinations of the eight 3D options selected in the uniparametric experiment resulted in the creation of 16 3D bpMRI models.Each of the trained 2D and 3D bpMRI models were uniquely identified as "Combo2D-[ID]" and "Combo3D-[ID]," respectively.The optimal 2D and 3D bpMRI resampling configurations were selected based on the validation score, and used to determine the final test performance.Two baseline models (2D and 3D) were developed on bpMRI datasets without resampling for comparison.

Statistical Analysis
Test performances (area under the curve [AUC], sensitivity, and specificity with 2000 times bootstrapped 95% confidence intervals [CI]) were compared for the best 2D bpMRI model and 3D bpMRI model.The same metrics were analyzed for the 2D and 3D baseline models.
Optimal sensitivity and specificity thresholds were obtained using Youden's index. 28DeLong comparisons were conducted to compare the test scores of the 2D and 3D images obtained with T2W and DWI sequences in the different resampling combination models with the respective baseline models without resampling. 29

Data Selection
The final dataset consisted of 737 lesions (the Standards for Reporting of Diagnostic Accuracy Studies diagram can be found in Fig. 1).The dataset split led to a training subset of 500 patients, a validation subset of 89 patients, and a separate test subset of 148 patients.The distribution of PI-RADS scores among the selected lesions was as follows: PI-RADS 0/1/2 (crPCa unlikely) with 131 patients, PI-RADS 3 (crPCa equivocal) with 152 patients, PI-RADS 4 (crPCa likely) with 318 patients, and PI-RADS 5 (crPCa highly likely) with 127 patients.The ISUP grade distribution for the selection was: ISUP 1 with 461 patients, ISUP 2 with 166 patients, ISUP 3 with 59 patients, ISUP 4 with 29 patients, and ISUP 5 with 22 patients.The resulting binary lesion labels were 461 nonrelevant entities and 276 crPCas.No significant differences were found in the distribution of crPCa versus nonsignificant entities between training/validation and test sets (X 2 = 0.77, P-value = 0.38).Patient age in the selected group ranged from 46 to 88 years with a median of 70 years.PSA levels ranged from 0.24 to 147 μg/L with a median of 8.9 μg/L.PSA density ranged from 0.01 to 3.14 μg/L 2 with a median of 0.194 μg/L 2 .

Discussion
This study explored image resampling in MRI radiomics for the detection of crPCa in a multicenter setting.We found a promising diagnostic performance for 2D and 3D radiomics models using optimized image resampling strategies, which outperformed the baseline models without resampling.No significant performance difference was found between optimized 2D and 3D radiomics, indicating that 3D radiomics calculation may have limited benefit over 2D radiomics for crPCa diagnosis.The observed performance improvement from applying image resampling can be attributed to sensitivity of various radiomics feature classes to heterogeneity in voxel spacing. 9,13,15,16Feature classes capturing texture patterns, such as gray-level co-occurrence matrix or wavelet features, may be particularly sensitive to voxel spacing variations, as they take into account spatial relationships between voxels.Similarly, features using gray-level run-length may be influenced, since images acquired at larger voxel spacings may result in shorter run lengths and vice versa.Furthermore, though their scores deviated slightly, the overall performance of 2D and 3D radiomics models was similar.This finding was unexpected since the 3D approach contains more image information, and it was expected to provide greater performance.One explanation might be the isotropic resampling approach used to make texture features rotationally invariant. 9The slice thickness range mostly did not overlap with the in-plane direction spacing range.Since the resampling process was isotropic and the spacing options were chosen based on the in-plane spacing, the slice thickness might have required larger resampling.
Another related observation is the consistently lower diagnostic performance for T2W models compared to DWI models in both 2D and 3D.The mismatch between in-plane resolution and slice thickness of T2W models compared to DWI models might play a role.For T2W, the original inplane resolution range (0.1-0.8 mm) did not overlap with slice thickness (2-5 mm), which made challenging the selection of an isotropic spacing that was close to both in-plane resolution and slice thickness.The higher degree of interpolation applied to the T2W images compared to DWI may have affected its performance.However, a similar lower performance was observed for the 2D T2W model, which does not use z-axis information, suggesting that T2W may have a potentially more negative effect on the diagnostic performance of radiomics crPCa than DWI.Another potential factor explaining the higher DWI performance compared to T2W, is that the ADC maps included in DWI dataset provide normalized and quantitative information, which may result in more consistent feature extraction despite differences in image acquisition.A previous study by Khalvati et al reported a similar lower crPCa diagnostic performance using T2W radiomics compared to ADC and DWI. 30Interestingly, they further found that a combined bpMRI radiomics model did not yield an improved performance compared to a uniparametric DWI model.It may therefore be interesting to explore the possibility of excluding T2W altogether for radiomics crPCa diagnosis to reduce time and costs.Nevertheless, further investigation is warranted as the feature selection preference investigation includes resampled 3D T2W radiomics features in the final model when available.This implies that 3D T2W features still carry some importance for the diagnosis of crPCa in the training setting.
The findings in this study are supported by previous research.Park et al showed that MRI radiomics feature robustness increased after pixel size resampling and interpolation when compared to the original MRI radiomics features. 16However, their study was performed on a smaller cervical cancer dataset of 254 patients.Orlhac et al validated a practical realignment approach to compensate for the technical variability of MRI radiomics features. 31Using a small multi-vendor PCa dataset (N = 39) they showed that ComBat harmonization, a method originally introduced to adjust batch effects in genomics, can effectively remove inter-center technical inconsistencies in radiomic feature values. 32However, the limited number of patients included raises questions about the generalizability of the results.Ligero et al also observed that ComBat compensation showed the highest improvement for radiomics based classification, but they also used a relatively small dataset (N = 43) and focused on CT liver instead of MRI in PCa. 15 While both studies observed that ComBat compensation has a positive effect on performance by reducing multicenter feature variability, it is important to note that the technique is not without challenges.ComBat requires extensive data labeling, which increases time investment. 33Moreover, completely new data needs to be added to the existing pool in order to be properly harmonized. 34Additionally, ComBat does not always seem to work in multicenter radiomics as observed by Bourbonne et al, who did not find significant improvement when including ComBat compensation. 35

Limitations
The retrospective nature of this study might have introduced some bias.However, the comparative design of the study mitigates the impact of this bias in the overall conclusions.Secondly, the current recommendations for interpolation optimization starting points have been derived specifically from bpMRI data for crPCa diagnosis.These starting points might be less effective when applied to other multicenter radiomics AI datasets.Third, the dataset used in his study was relatively small, which may affect the generalizability of the findings.

Conclusion
Image resampling has a significant effect on the performance of multicenter radiomics PCa AI.Experimentation is recommended to find the optimal resampling settings.A recommended starting point for 2D is isotropic resampling with: T2W at 0.5 mm using Bspline interpolation and DWI at 2 mm using nearest neighbor interpolation.For 3D resampling, the suggested starting point is isotropic resampling with T2W at 0.8 mm using linear interpolation and DWI at 2.5 mm using nearest neighbor interpolation.

FIGURE 1 :
FIGURE 1: Standards for Reporting of Diagnostic Accuracy Studies flow diagram for lesion selection included in the study.