Automatic Multiparametric Magnetic Resonance Imaging-Based Prostate Lesions Assessment with Unsupervised Domain Adaptation

Multiparametric magnetic resonance imaging (mpMRI) has emerged as a valuable diagnostic tool in prostate lesion assessment. However, training convolutional neural networks (CNNs) inevitably involves magnetic resonance (MR) images from multiple cohorts. There always exists variation in scanning protocol among cohorts, inducing significant changes in data distribution between source and target domains. This challenge has greatly limited clinical adoption on a large scale. Herein, a coarse mask‐guided deep domain adaptation network (CMD2A‐Net) is proposed to develop a fully automated framework for prostate lesion detection and classification (PLDC). No category or mask label is required from the target domain. A coarse segmentation module is trained to cover the possible lesion‐related regions, so that attention maps can be generated to dedicate the local feature extraction of lesions within those regions. Experiments are performed on 512 mpMRI sets from datasets of PROSTATEx (330 sets) and two cohorts, A (74 sets) and B (108 sets). Using ensemble learning, CMD2A‐Net accomplishes an AUC of 0.921 in cohort A and 0.913 in cohort B, demonstrating its transferability from a large‐scale public dataset PROSTATEx to small‐scale target domains. Results from an ablation study also support its effectiveness in classification between benign and malignant lesions, compared to the state‐of‐the‐art models. An interactive preprint version of the article can be found here: https://doi.org/10.22541/au.166081031.11420810/v1.


Introduction
Prostate cancer (PCa) is the second most prevalent cancer among males. [1]The number of diagnoses is estimated to increase by %1.7 million worldwide by 2030. [2]Accurate prostate lesion assessment, particularly for classifying clinically significant PCa (csPCa; Gleason score [GS] ≥7) [3] from indolent non-csPCa, can vastly improve the facilitation of tailored treatments. [4]The broad range of PCa's behavioral pathology makes assessment challenging. [5]Current clinical assessment relies on prostate-specific antigen (PSA) blood testing, which, if positive, requires a transrectal ultrasound (TRUS) biopsy.However, PSA in conjunction with blind TRUS biopsy has a high false-negative rate (%20%), resulting in unnecessary biopsies. [6]It is also highly prone to causing underdetection of csPCa or overdetection of non-csPCa. [7]ultiparametric magnetic resonance imaging (mpMRI) has become a gold standard for PCa diagnosis, even prior to biopsy. [8]It typically involves T2-weighted (T2), high diffusion-weighted imaging Multiparametric magnetic resonance imaging (mpMRI) has emerged as a valuable diagnostic tool in prostate lesion assessment.However, training convolutional neural networks (CNNs) inevitably involves magnetic resonance (MR) images from multiple cohorts.There always exists variation in scanning protocol among cohorts, inducing significant changes in data distribution between source and target domains.This challenge has greatly limited clinical adoption on a large scale.Herein, a coarse mask-guided deep domain adaptation network (CMD 2 A-Net) is proposed to develop a fully automated framework for prostate lesion detection and classification (PLDC).No category or mask label is required from the target domain.A coarse segmentation module is trained to cover the possible lesionrelated regions, so that attention maps can be generated to dedicate the local feature extraction of lesions within those regions.Experiments are performed on 512 mpMRI sets from datasets of PROSTATEx (330 sets) and two cohorts, A (74 sets) and B (108 sets).Using ensemble learning, CMD 2 A-Net accomplishes an AUC of 0.921 in cohort A and 0.913 in cohort B, demonstrating its transferability from a large-scale public dataset PROSTATEx to small-scale target domains.
Results from an ablation study also support its effectiveness in classification between benign and malignant lesions, compared to the state-of-the-art models.An interactive preprint version of the article can be found here: https://doi.org/10.22541/au.166081031.11420810/v1.
(hDWI) sequence, and its derivative apparent-diffusion coefficient (ADC) maps. [4,9]Although the magnetic resonance imaging (MRI) acquisition and interpretation have been standardized with the guidance of Prostate Imaging Reporting and Data System version 2.1 (PI-RADS v2.1), [10] image interpretation is still time-consuming for the readers, [5] and inevitably significant inter-reader variation still exists. [11]To this end, numerous learning-based methods have been proposed to facilitate efficient, accurate, and reliable prostate lesion assessment.In 2017, an international contest PROSTATEx Challenge [12] was organized.Twenty-one teams proposed their models with the area under receiver operating characteristics (ROC) curves (AUC) ranging from 0.80 to 0.87. [13]Unlike the traditional methods relying on inputs with handcrafted features, [14] all of them employed CNNs [15] to detect the complex semantic features automatically, demonstrating significant advantages of PLDC over traditional methods.
To enhance network training, prostate MR images have to be precropped manually, so as to retain the prostate region that originally occupies a small portion of the entire image set.Few recent studies, e.g., refs.[2,16], proposed automated PLDC frameworks to reduce effort from repeated manual prostate segmentation.CNNs were also utilized to segment the target region, identifying the prostate profile.These studies, despite notable progress, still assumed the training/testing datasets have to be shared the same data distribution from the source and target domains.This would be an overly ideal assumption, [17] as in normal practice, prostate MR images from a single cohort could not avoid the nature of medical data scarcity, or they are typically publicly unavailable. [14]Most likely, it is necessary to collect and aggregate images from multiple cohorts to maintain sufficient samples for robust model training.Inevitably, these multisite images exhibit apparent discrepancies in terms of scanning protocols, in-plane resolutions, field of views (FoV), etc. [17,18] These inherent intersite discrepancies would cause "domain shift" while having the models trained in the source domain, but applied in the target domain.This can significantly degrade the overall model performance, biasing the PLDC results.
Several paradigms have been proposed to resolve the domain shift.An intuitive solution is directly mixing heterogeneous images from multiple cohorts to make the training data adequate.However, in this approach, the model's prediction capability could not be explicitly improved, and in contrast, would be limited by overfitting when distribution heterogeneity is significant. [18,19]nother common practice is pretraining the model in the source domain and then fine-tuning it in the target domain.This generally requires sufficient labeled data from the target domain to manually tune massive network parameters, which can still be a labor-intensive process.Domain adaptation (DA) has emerged as a more promising method, allowing effective knowledge transfer [17,20] from the label-rich source domain to the target domain.Recently, unsupervised DA (UDA) methods have drawn increased attention, as they do not require target labels for training. [21]These can be generally categorized as image translation and feature alignment approaches.In the former, the models can align image appearance [17,22] by translating images from one domain to another using generative models, such as generative adversarial networks (GANs). [23]Difficulties mainly come from whole-slide image translation, and image synthesis due to insufficient image similarity.Additionally, these models usually focus on low-level feature extraction, suffering from inconspicuous lesion texture and characteristics. [24]In contrast, the latter, feature alignment-based models could be more effective in resolving domain shift by extracting domain-invariant features, either minimizing correlation distance between domains, [25] or assimilating feature distributions through adversarial learning. [26]Yet, very few of them are dedicated to prostate lesion detection and/or classification, particularly using mpMRI.Therefore, an effective UDA model for fully automated mpMRI-based PLDC is highly desirable for use prior to any invasive biopsy.
In this work, we develop a CMD 2 A-Net for both coarse prostate lesion detection and lesion malignancy classification.We also extend the proposed network to an open-sourced system.This executable end-to-end system takes mpMRI sequences as input, and outputs coarse lesion contours as well as lesion malignancy.The system can also be downloaded online.Our work contributions can be summarized below: 1) Development of a deep-learning-based system for fully automated prostate lesion assessment.Our end-to-end system is dedicated to PLDC on multicohort mpMRI without the need for prior manual processing on mpMRI sequences.2) Design of a UDA model (i.e., CMD 2 A-Net) capable of leveraging cross-site representation transfer to realize accurate PLDC without requiring target labels.Weakly supervised coarse lesion segmentation modules are incorporated to extract informative lesion features, thus facilitating feature alignment between domains.3) Experimental evaluation of CMD 2 A-Net on one public dataset (i.e., PROSTATEx [12] ) and three local cohort datasets, including lesion assessments with various mpMRI sequence inputs, comparisons with state-of-theart models, as well as an ablation study.The capability of transferring knowledge from PROSTATEx to our small-scale local cohort datasets is demonstrated against the state-of-the-art models.

Related Work
CNNs have been proved effective and widely applied for mpMRIbased PCa classification with promising performance.Wang et al. [13a] explored optimal combinations of mpMRI sequences as input for the CNN, and their model achieved an AUC of 0.95, which was reported to outperform all models in the PROSTATEx Challenge.Instead of only PCa classification, Kiraly et al. [27] developed a model with an encoder-decoder architecture to detect prostate lesions and simultaneously classify lesion malignancy.22a,28] End-to-end PLDC frameworks have also been investigated, with the aim to avoid the need for manual prostate segmentation.Yang et al. [2] incorporated a CNN for automatic segmentation in advance to the PLDC.Insufficient prostate image features extracted by the shallow network (i.e., five layers) could deteriorate the overall segmentation performance.Later, Wang et al. [29] proposed a deeper prostate segmentation model capable of detecting more complex features.Apart from improving the segmentation performance, fusing spatial features using 3D CNNs is also another means to enhance the accuracy of PCa classification.Mehta et al. [30] employed a patient-level 3D model for binary classification using volumetric mpMRI, achieving an AUC of 0.79 and 0.86 on their local cohort dataset and PROSTATEx, respectively.However, only single-cohort datasets were used to evaluate the model.Domain shift would occur when it is directly applied to an unseen cohort. [17,18]Provided with very few studies (e.g., Mehta et al. [30] ) that use mpMRI sequences from multiple cohorts, they could just directly combine the heterogeneous images, giving rise to sufficient samples for model training, but inevitably ignore data source heterogeneity.This approach would be prone to suffering from severe domain shift, thus biasing predictions by particular cohorts.
Very recently, many studies have attempted to investigate DA approaches to alleviate intersite distributional variability, among which UDA methods demonstrated their advantages in exploiting unlabeled target samples. [20]Such UDA methods can be categorized into two groups: 1) image translation and 2) feature alignment approaches.The former performs image appearance alignment. [17,22]The resultant models translate images across domains using GAN-based networks. [23]However, texture similarity between the synthesized target image and the source image would be crucial for the PLDC problem.22c] Lesions could also be missed during the translation process due to varying transferability among image regions, thus worsening the DA process. [31]oreover, the GAN models would distort the nonlesion region's appearance, further causing unreliable lesion assessment results. [24]y using feature alignment approaches, domain-invariant features are extracted to reduce domain shift. [26]A common way is to minimize distribution similarity (e.g., second-order correlation [25] ) between domains using Siamese network architecture.Adversarial learning [26a] can also align features by enforcing the cross-domain features indistinguishable using a domain classifier.For instance, Wang et al. [14] developed a GAN-based method to learn domain-invariant features from mammographic images acquired for breast cancer screening.26b,28] Previous works [24,26b] revealed that not all image regions can facilitate knowledge transfer across domains.Roughly aligning the features in the whole image set would introduce irrelevant knowledge, resulting in ineffective DA.It is hypothesized that the background regions on mpMRI sequences, such as regions outside the prostate gland, would not significantly improve DA in our PLDC problem.13b] 3. Results and Discussion

Datasets
Five datasets were utilized in this study, i.e., Initiative for Collaborative Computer Vision Benchmarking (I2CVB), [34] PROSTATEx (P-x), and three datasets from Hong Kong hospital local cohorts, LC-A, LC-B, and LC-C.Note that, LC-A and LC-B were acquired from the same MR imaging center.Table 1 shows the characteristics of these five datasets.Note that I2CVB is already available online (https://i2cvb.github.io/),which has been widely investigated for prostate zone segmentation. [8]It contains 646 T2 images acquired from 36 patients.Fifteen patients were scanned by 3.0-T Siemens scanners and 21 patients by 1.5-T General Electric scanners.Given the segmentation labels on the prostate, central gland, peripheral zone, and lesion, only image slices covering the prostate were selected as our samples.A Mask R-CNN model was employed for prostate segmentation using this dataset.P-x (https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=23691656), LC-A, LC-B, and LC-C are mpMRI-based datasets marked with point labels.The four datasets share the same set of category labels (i.e., csPCa and non-csPCa).Datasets, P-x, LC-A, and LC-B, were utilized to evaluate the PLDC performance of our CMD 2 A-Net, including 330 cases from P-x, 74 cases from LC-A, and 108 cases from LC-B.To avoid "overfitting" caused by LC-C (29 cases) with its small size, it was only used for cross-site heterogeneity analysis.
The mpMRI samples from multiple domains exhibit apparent interdomain heterogeneity, [35] which was caused by differences in MRI scanners, diffusion b-values, in-plane resolutions, FoV, and subject cohorts/patient populations.As shown in Figure 1, the MRI [36] examples from P-x and LC-A present apparent interdomain heterogeneity, demonstrating visible discrepancies in lesion morphology, prostate gland appearance, and image intensity distribution.These inherent multidomain discrepancies are inevitable, and cause "domain shift" [37] that significantly degrades overall model performance in PLDC.

Analysis of Cross-Site Heterogeneity
We first evaluated prostate segmentation performance using mean intersection over union (IoU), in order to ensure that the prostate regions can be predicted accurately.The IoU indicates the intersection between the predicted prostate contour and the ground truth mask label, which was measured on the test split of I2CVB.The mean IoU of the prostate region, central gland, and peripheral zone is 0.843, 0.781, and 0.516, respectively.These results are comparable with the work of Alkadi et al. [8] which attained an IoU of 0.673 and 0.599 for the central gland and peripheral zone, respectively.This implies that the training set, which contains MR images from 36 patients, is already sufficient for accurate prostate segmentation.Additionally, the segmentation results are found to be promising on the image obtained from either a 1.5-T or 3.0-T MRI machine, indicating that the IoU measuring is not sensitive to the scanner types (see details in Figure S1, Supporting Information).Then, we analyzed cross-site heterogeneity on our multicohort datasets (P-x, LC-A, LC-B, and LC-C).We aim to verify whether the prior MR image intensity normalization (e.g., Liu et al. [38] ) is effective to reduce domain shift, when domain knowledge is not considered.Coarse Mask-guided Network (i.e., CM-Net, in Figure 5) was utilized for cross-site heterogeneity analysis.Here, training a model on an individual dataset is defined as a "separate learning approach," while training a model using a combined dataset from multiple cohorts is defined as a "joint learning approach."As shown in Table 2, we trained the CM-Net using the individual and combined datasets from P-x, LC-A, and LC-B.The three separate models were individually trained in these three domains.They were set as the baselines for comparisons with the joint models.During the testing phase, each separate model was tested on the four datasets.LC-C only acted as the hold-out testing set for domain shift analysis, as its small size (only 29) would cause overfitting in training and biased prediction in testing.Note that, owing to the limited sample size of local cohorts (74 and 108 cases on LC-A and LC-B, respectively), separate models of LC-A and LC-B were pretrained on the large-scale dataset P-x (330 cases) and then fine-tuned on the corresponding domain.Such a transfer learning strategy would reduce overfitting caused by data scarcity.A common preprocessing method, scaled, was employed to normalize the image intensities within [0,1].
The results of separate models from P-x, LC-A, and LC-B are shown in Table 2.For the three sequences (i.e., T2, ADC, and hDWI), the AUCs of three separate models are relatively high when tested within their respective domains, but these AUCs sharply drop when directly tested in the unseen domains.Such results show the sensible cross-domain discrepancy (i.e., domain shift) among the four datasets.Note that, in terms of the T2 sequence, separate models of LC-A and LC-B accomplish the highest testing AUCs (0.66 and 0.67) in the unseen domain, LC-C, which is just marginally higher than those (0.61) within their corresponding domains.A potential reason for the biased predictions is the deficiency of testing samples (i.e., 29) on LC-C.When it comes to the joint models in the table, they cannot bring remarkable improvements in each sequence compared with the separate models; instead, they may even lead to performance degradation due to cross-site heterogeneity.
With severe discrepancies among our datasets, we intend to validate whether the rigorous MR image preprocessing methods can contribute to classification performance of the joint models.Similar to scaled, whitening is another common preprocessing method, capable of normalizing the pixel values with a mean of zero and unit variance.We took the combined dataset, P-x and LC-A, as a representative for evaluation.In Table 3, scaled, whitening, and their combined function with bias field correction (BFC) or noise filtering (NF), six preprocessing methods in total, were adopted as in ref. [38].The joint models using scaled and whitening acted as the two baselines for comparisons with the rigorous MR image preprocessing methods (i.e., BFC and NF). Figure 2 depicts the image preprocessing examples of three methods (i.e., whitening, whitening þ BFC, and whitening þ NF).The left and right halves of each sample represent before and after preprocessing, respectively.Before preprocessing, we can observe noticeable intensity distribution discrepancies among the samples.The samples from LC-A are characterized by larger numbers of low-intensity grayscale pixels as compared with the images of Px.Subsequently, the jet color maps were employed to visualize the intensity distribution between domains after preprocessing.All the color maps shared the same intensity color scale.Similar intensity distributions among the samples can be found after preprocessing, demonstrating the effectiveness of the methods in image distribution harmonization.
In Table 3, for the T2 sequence, BFC with either scaled or whitening outperforms the baselines.BFC with whitening also achieves the best AUCs of 0.91 and 0.80 on P-x and LC-A, respectively.However, these findings are not consistent with the results in ADC and hDWI.In terms of ADC, the models preprocessed with BFC or NF underperform the baselines.Instead, the baseline models receive the highest AUCs, where scaled alone and whitening alone accomplish 0.73 and 0.72 on P-x and LC-A, respectively.When it comes to the sequence of hDWI, both BFC and NF demonstrate limited improvement over the baselines.On P-x, the AUC increases marginally from 0.73 (scaled only) to 0.80 (scaled with NF); on LC-A, only an AUC of 0.65 is achieved using scaled with BFC.The above results of the three sequences show that these preprocessing approaches could improve CM-Net's The bold/shading data highlight the baseline models performance in the three sequences (T2, ADC, hDWI).

Figure 2.
Image preprocessing examples (from P-x and LC-A) in quantitative analysis on intersite heterogeneity.Among the six methods in Table 3, whitening, whitening þ BFC, and whitening þ NF act as representatives.Coarse lesion region is contoured (in red) on the randomly selected pre-cropped T2 images.Prior to the preprocessing (left half ), the heterogeneity of intensity distribution can be observed obviously in the original samples, while the distributions are harmonized after the preprocessing (right half ).All the jet color maps share the same scale.classification performance when combining our two datasets.However, none of the methods is capable of boosting the joint models' generalization considerably, as compared with the separate models of P-x and LC-A (in Table 2).This indicates that the preprocessing methods are probably insufficient to solve domain shift fundamentally.A possible reason is that the severe discrepancies may also come from the intersite discrepancies (in Table 1), rather than only from the intensity distribution of the heterogeneous mpMRI sequences (see details in Figure 1).

Cross-Domain Malignancy Classification and Lesion Detection
We emphasized the importance of knowledge transfer from a large-scale public dataset to a small-scale target domain.The malignancy estimation performance of CMD 2 A-Net (the architecture is shown in Figure 5) was evaluated.The dataset, P-x, was only regarded as the source domain.Either LC-A or LC-B was also set as the source domain for knowledge transfer between local cohorts.The scaled method was employed for image preprocessing.In general, available types of MR sequences may vary in healthcare institutions.Thus, we employed ensemble learning to handle multiple sequences, allowing the use of single and multiple sequence(s) in our framework.Three common metrics were adopted for classification performance evaluation, i.e., AUC, sensitivity (SEN), and specificity (SPE).Table 4 illustrates the classification results (i.e., csPCa or non-csPCa).Seven sequence combinations were involved for comparisons.The former and the later domains in the table are denoted as the source and target domains, respectively.We define such pairs of domains/cohorts as DA settings.First, we compared CMD 2 A-Net with the separate and joint models (in Table 2) in terms of AUC.Take the first DA setting (P-x !LC-A) as an example.In the T2 sequence, CMD 2 A achieves an AUC of 0.87 in the target domain (i.e., LC-A), outperforming both the separate model (AUC: 0.61) and the joint model (AUC: 0.67).Consistent findings can be observed in ADC and hDWI.When it comes to the other three DA settings, CMD 2 A-Net also demonstrates its advantage in resolving domain shift between two of our datasets.This validates our hypothesis that incorporating prostate lesion information in prior to the DA process can facilitate PCa classification.
Second, we analyzed our model's PCa classification performance using a single sequence, i.e., T2, ADC, or hDWI.In most source-target DA settings, T2 is the most effective, while ADC receives the lowest AUC.The sequence, hDWI, shows unstable performance in the four DA settings.For example, it accomplishes the most superior performance (w.r.t.AUC, SEN, and SPE) in "P-x !LC-B," but underperforms T2 and ADC in "LC-A !LC-B."This could be caused by heterogeneous b-values among the domains.As shown in Table 1, b-values of 50, 400, and 800 s mm À2 were employed on P-x, while 0 and 1400 s mm À2 were used in LC-A, and 1000 and 1400 s mm À2 were used in LC-B.Thus, we can conclude that the significant discrepancies in the acquisition parameters would result in the inconsistent performance of hDWI.Note that there were no widely accepted guidelines regarding b-value until the release of PI-RADS in 2019, which recommended a minimum value of 1200 s mm À2 .
We also investigated the effect of ensemble learning using multiple sequences, which could provide references to choose appropriate sequences for PLDC.In each DA setting, the models using multiple sequences are always more effective than those relying on a single sequence.Besides, although ADC or hDWI always leads to the worst classification results, T2 ensembled with one or both can explicitly enhance the model's performance.This finding is consistent with the clinical practice of using mpMRI for PCa diagnosis.Sequences ADC and hDWI are usually regarded as secondary references by radiologists.It should be noted that the all-sequence-ensembled (i.e., ensemble of T2, ADC, and hDWI) models show significant predictions in most DA settings.Although an ensemble of the three sequences could not yield the best performance in the second DA setting (i.e., P-x !LC-A), the model still achieved a remarkable AUC of 0.91, which is only about 1% lower than the highest AUC (0.92).It can be concluded that using more sequences would help multicohort MRI harmonization, thus boosting the final classification performance.Moreover, with the same target domain (i.e., either LC-A or LC-B), the CMD 2 A-Net transferred from P-x attains a higher AUC than transferred from a local cohort domain in each sequence combination.This implies more source samples could enhance the model's cross-domain knowledge transferability, thus improving the model's generalization in the target domain.The superior performance also demonstrates CMD 2 A-Net's capability of transferring the knowledge from a public dataset to our local cohort domains.Two DA settings (i.e., P-x to LC-A, and P-x to LC-B) were selected as representatives for lesion detection evaluation.Results of the all-sequence-ensembled method were selected as a representative for analysis.In the correctly classified examples, coarse lesion contours could encircle the lesion ground-truth point in all sequences (as shown in Figure 3a).However, in the unclassified examples, the coarse lesion position could not be precisely detected in most sequences as shown in the third row.In the example of LC-A, the lesion on the T2 image was correctly detected, but the lesion contours on ADC and hDWI maps were falsely identified.The possible reason is that the coarse lesion masks applied as the training ground truth could not depict the actual lesion contours accurately.Therefore, we can observe that accurate detection on ADC and hDWI also play a role in enhancing the ensembled classification, although lesion detection generally heavily relies on T2 images.In the future, robust weak label processing methods (e.g., deep extreme level set evolution method [39] ) will be employed.For the example from LC-B, undersegmentation of the prostate region can be found on the T2 image, which could lead to failed lesion detection.As the prostate regions on ADC and hDWI were transformed using T2, under/oversegmentation of the prostate gland on T2 would deteriorate the lesion detection in the other two sequences.Despite the inaccurate lesion detection on ADC and hDWI, it should be noted that the models with multisequences input still outperform the models using T2 alone in lesion classification, accredited to the reuse of prostate features from ADC and hDWI.

Comparisons with the State-of-the-Art Methods
We compared our model with three state-of-the-art models using AUC, i.e., Resnet50, [40] DANN, [41] and Deep Coral. [25]The dataset, P-x, was used as the source domain.Our local cohort datasets, LC-A and LC-B, acted as the target domains.The individual (i.e., T2, ADC, and hDWI) and the ensembled (i.e., T2 þ ADC þ hDWI) sequences were involved.The other ensembled sequences, T2 þ ADC, T2 þ hDWI, and ADC þ hDWI, were not involved here due to their inferior performance as discussed in Section 4.2.Detailed comparison results are summarized in Table 5.
Resnet50 is a common classification model.It was pretrained from the source domain, and then tuned and tested in the target domain; therefore, no DA was utilized.In Table 5, Resnet50 underperforms all other methods in all sequences.A possible reason may be the weak cross-domain knowledge transferability of the fine-tune strategy.This shows the advantage of domain adaptation methods over the fine-tune strategy.Another model, DANN, is GAN-based and has been widely employed in lesion assessment.It can extract low-level features from the entire image.Deep Coral was also introduced, which can leverage domain knowledge transfer by aligning the second-order All-sequence-ensembled (T2, ADC, hDWI) approach is employed.In lesion detection results (first and second rows), the lesions (ground-truth) are pointed in green.The predicted coarse lesion regions are colored in yellow.Promising prediction of lesion region, i.e., containing the ground-truth in all sequences, can yield the higher correctness of classification as in (a).Moreover, undersegmented prostate regions marked with yellow boxes/contours (i.e., the example of LC-B in the third row) would also worsen the classification outcome.
statistics.Similar to DANN, it also adopts a common encoder for feature extraction from the input of a whole image slice.Comparatively, our model could fuse both lesion features and prostate features for effective DA, instead of extracting the prostate features.We "strengthened" the point labels to be coarse mask labels, such that features, particularly lesion features, can be robustly aligned for DA using the mask labels.In Table 5, CMD 2 A-Net outperforms the two UDA models in all the sequences in terms of AUC, indicating the effectiveness of our model in cross-domain feature harmonization and its advantage in prostate lesion classification.It is worth noting that all four models accomplish their highest AUCs using the ensembled sequence.A consistent conclusion can be found in Section 4.2, showing the benefits of the all-sequence-ensembled method again.

Visualization of Sample Distribution and Ablation Study
Apart from AUC, we also visualized the sample distribution of source and target domains, in support to any improved performance of handling domain shift intuitively.Datasets, P-x and LC-A, were adopted to visualize the data distribution before and after the DA.Algorithm, t-SNE, [42] was employed to visualize the data distributions of all sequences, i.e., T2, ADC, and hDWI.Fifty mpMRI cases from each dataset were randomly chosen.As shown in Figure 4a-c, obvious clustering can be observed before DA in each sequence, indicating severe domain shift between the two domains.After CMD 2 A-Net training (i.e., DA), domaininvariant features were extracted by the well-trained model.After the DA, samples from the two cohorts for each sequence are evenly distributed, proving that CMD 2 A-Net could assure feature alignment on the heterogenous mpMRI sequences.
To carry out the ablation study, we selected two key components, i.e., the coarse segmentation module and the domain transfer module, to analyze their contribution to lesion malignancy classification using T2 images.We compared our CMD 2 A-Net with its two variants using AUC, i.e., 1) CMD 2 A-Net excluding the domain transfer module (i.e., CM-Net, shown in the black dashed box in Figure 5) and 2) CMD 2 A-Net excluding the course segmentation modules (D 2 A-Net).As the CM-Net does not contain DA modules, it was trained in the source domain, and then fine-tuned and tested in the target domain.Datasets, P-x and LC-A, were selected as the source and target domain, respectively.D 2 A-Net obtains a lower AUC (0.65) compared with CMD 2 A-Net (0.87).This suggests that the coarse segmentation module is essential for domain-invariant feature  extraction between domains.This also supports our hypothesis that the coarse lesion maps would enhance the malignancy classification accuracy.CM-Net obtains an AUC of 0.67, also less than CMD 2 A-Net.This indicates that the domain transfer module can substantially mitigate domain shift, thus enhancing CMD 2 A-Net's PCa classification performance.The loss parameters sensitivity was also analyzed.CMD 2 A-Net was trained using P-x (source domain) and LC-A (target domain).Hyperparameters α and β(i.e., weighting parameters of the total loss) in Equation ( 6) would influence the model's generalization ability essentially.The two hyperparameters could not be learned, which were preset prior to the model training.They were used to balance the contributions of the three network modules, such that joint optimization on all modules can be realized, thus facilitating the training process to reach equilibrium.Therefore, we manually tuned the hyperparameters in {0, 0.1, 0.05, 0.1, 0.5, 1.0, 5.0} to analyze the modules' contribution to lesion malignancy classification.As shown in Figure 4d, we could see a group of hyperparameters were preset to yield an optimal model.We can observe that our model demonstrates superior classification performance with α within [0.1, 1.0] and β within [0.05, 1.0].It should be noted that our model receives the lowest AUC when either α or β is set to 0, showing that the coarse segmentation module and the domain transfer module could enhance crossdomain knowledge transferability positively, thus improving lesion classification accuracy.

Conclusion
In this article, we address the issue of performance heterogeneity in target domains arising from real-world usage across multiple sites/cohorts.We present a fully automated framework for mpMRI-based prostate lesion assessment.The framework involves a Mask R-CNN network to pre-crop the prostate region, and a novel UDA network (i.e., CMD 2 A-Net) for coarse lesion detection and malignancy classification (i.e., csPCa or non-csPCa).By introducing weakly supervised coarse segmentation modules, CMD 2 A-Net can incorporate both the prior lesion features and prostate features into the domain knowledge transfer process, yielding robust feature alignment between heterogeneous datasets.No labeling is necessary for the target domain.CMD 2 A-Net serves as a general UDA model primarily designed for PLDC, which could also be applied in other lesion assessment tasks (e.g., liver tumors).Its PLDC performance has been evaluated on datasets of P-x, LC-A, and LC-B.The models with multisequence input accomplish higher AUC than any model using a single sequence only.The all-sequence-ensembled (T2, ADC, and hDWI) model demonstrates the most superior PCa classification performance w.r.t.AUC, SEN, and SPE.Additionally, when P-x acts as the source domain, the model ensembled with all the three sequences accomplishes an AUC of 0.921 in LC-A and 0.913 in LC-B, demonstrating its transferability from a largescale public dataset P-x (330 cases) to our small-scale local cohorts (LC-A with 74 cases and LC-B with 108 cases).Experimental results also show that our model accomplishes higher AUC in PCa malignancy classification, compared to the state-of-the-art models, Resnet50, DANN, and Deep Coral.Other experimental results, including an ablation study and visualization of data distribution, further support the effectiveness of CMD 2 A-Net in domain adaptation.It is worth noting that our open-sourced system can be downloaded from GitHub (https://github.com/jdai019/domain-adaptation-lesion-assessment.git),capable to streamline the PLDC in an end-to-end manner without requiring manual prostate segmentation and annotation.We would be the first to develop a PLDC executable system available online for open usage, which is also deep-learning-based and trained by multicohort mpMRI sequences.In our future work, we will resolve few limitations: currently, lesions are distributed in different prostate zones (e.g., transition zone and peripheral zone).We will incorporate the prostate zones as input parameters to our model, in order to attain a higher AUC for prostate lesion assessment.Deep learning will also be used properly to facilitate effective feature extraction for the prostate zones.

Experimental Section
Weakly Supervised Coarse Lesion Detection: We employed Mask R-CNN [43] to crop the prostate region accurately for PLDC.When a sample is fed in, the prostate region and the remaining areas can be separated, respectively, as foreground (as known as image mask) and background.In general, the T2 sequence is necessary for the model input, while other ADC and hDWI are optional.A circumscribed rectangle (as the bounding box) of the detected prostate contour marks out the regions of interest (ROIs).The prostate mask on T2 images can also be applied to other accompanied input images (e.g., ADC, hDWI) through coordinate transformation, to obtain their corresponding ROIs.In comparison with Yang et al. [2] using a five-layer shallow segmentor, we chose a deeper feature extractor, i.e., Resnet50, [40] such that more complex features can be learned for more accurate prostate detection.Besides, a multiscale deep spatial feature extraction module, i.e., Feature Pyramid Networks, [44] was utilized to deal with FoV difference and varying prostate cross-sectional size.
Since fine delineation of the lesion region (e.g., the pixel-level label) is time-consuming, and demanding to even professionals, prostate lesions are commonly marked with a typical weak label, [45] i.e., point label.In general, weak labels are widespread in real-world applications.Recent work has explored various forms of weakly supervised labeling to alleviate the annotation effort, including point annotations, scribbles, and bounding boxes. [46]To accomplish prostate lesion detection, pixel-level labels are required prior to model training.However, the point label is insufficient to represent the prostate lesion area for training, as the lesion area not marked or pinpointed with such a point label would be probably miscategorized as healthy tissue.
We attempted to "strengthen" the existing point-level labels to coarse lesion areas by aggregating their neighbor pixels into a region through preprocessing.Such preprocessed areas would be comparable to "strong" labels (i.e., manually labeled pixel-level contours), providing promising cues for lesion detection model training.Recently, Kiraly et al. [27] expanded the single marked pixel to a small-diameter circle using Gaussian kernels, but such a processing method focuses on lesion localization rather than contour approximation.Here, we applied a more sophisticated weak label processing method, i.e., distance regularized level set evolution, [47] to automatically generate a pixel-level weak label, i.e., a coarse mask label (in Figure 5).This level set method is an edge-based active contour approach.The labels can be produced in three steps: 1) initialize a level set function to represent the lesion contour originated from a manually marked point; 2) expand the lesion contour outward and update the level set function; and 3) terminate the expansion and finalize the function once exceeding the predefined iteration steps.As shown in Figure 1, several examples of the coarse mask labels were automatically generated using such a level set method.The coarse mask contours were annotated (in red) on the cropped prostate regions (2nd and 4th row).Therefore, the weak labels were "strengthened" from points to coarse lesion areas through preprocessing.The weak supervision would significantly reduce the time needed for accurate pixel-level annotation by experts, so as to enable coarse lesion detection and enhance malignancy classification.
Figure 5 illustrates the network architecture of the proposed CMD 2 A-Net.The coarse segmentation module outputs coarse lesion contour and also enables local feature extraction on lesion regions.Provided with more lesion features, the domain transfer module is introduced to facilitate feature alignment.A classifier module is incorporated for malignancy prediction.CMD 2 A-Net is trained on the three sequences (i.e., T2, ADC, and hDWI) individually.Based on the model output (i.e., lesions malignancy probability) of the three sequences, we can obtain the final malignancy predictions using ensemble learning.CMD 2 A-Net has two parallel branches with respect to (w.r.t.) the source and target domain, where two encoders extract features of prostate MR images separately in the two domains.The segmentors from the two domains share the same weights.The source segmentor is optimized by a supervised loss function (i.e., coarse lesion segmentation loss).Samples and coarse mask labels from the source domain are required for training.The segmentation loss L Seg can be defined as where s i,j and m i,j indicate the pixel element values of mask label S and predicted lesion map M, respectively.Indices i and j denote the ith column and jth row of the image matrix in a dimension of w Â h.Constant value, ε (set to 10 À5 ), is applied to avoid the zero-denominator case, as well as to guarantee numerical stability.Attention-Based Malignancy Estimation: In recent studies of prostate lesion classification (e.g., Guan et al. [28] ), lesion identification was suggested to be highly associated with disease-related regions in MR images.Instead of treating all pixels in the entire MR slice equally, an attention mechanism can be introduced to specifically extract lesion features.With these insights, we hypothesized that incorporating the prior knowledge of lesion regions into the DA process could enhance the model's classification performance.As illustrated in Figure 5, the two branches follow the same pipeline to generate attention feature maps.In each branch, the attention map can be produced using the prostate region and the coarse lesion mask, enabling our model to focus on the lesion region and also extract more lesion representations.The prostate region and the coarse lesion mask are denoted as P and M, respectively.Note that the subscripts "s" and "t" of variables (e.g., M s and M t ) in Figure 5 represent the source and target domains, respectively.The attention maps of source and target domains, A s and A t , respectively, can be calculated by where the operation ∘ means the element-wise product, and the sigmoid function is denoted by σ which is adopted as the nonlinear activation to generate attention maps.Such a simple but effective function can constrain each element of the feature maps in [0,1], thus weighting the importance of regions.As a result, guided by coarse mask labels, the lesion areas would be assigned higher weights than the noninformative background (i.e., healthy tissue) in the feature maps.
To achieve accurate lesion classification, features from the lesion attention maps can be extracted by an encoder, such that high-level lesion features can be captured for the classifier module.Thus, in each branch, an encoder is incorporated after the segmentor to extract each domain's specific features.Besides, we proposed to fuse the lesion features and the prostate features to boost the classification accuracy.Skip connection and concatenation operations are introduced to reuse prostate features from the segmentors.
We designed a domain transfer module (in Figure 5) without requiring target labels in the training process.The semantics features from both the prostate region and attention map are fused, such that deep coral features from fully connected (FC) layers can be captured for feature affinity.Deep Coral loss [25] is employed to minimize cross-domain feature distribution discrepancy, owing to its generality, transferability, and ease of implementation.It is defined as the difference of second-order covariances between domains.Our domain transfer loss L Coral is defined as where l indicates the number of FC layers.Constants λ i , i ¼ 1, 2, : : : , l are the weights that balance the contribution of FC layers, which are set to 1 here.The squared matrix Frobenius norm is denoted as k ⋅ k 2 F .The dimension of the ith FC layer is indicated by d i .The feature covariance matrices of source and target domains, C s and C t , respectively, can be calculated by where n i denotes the number of images in the corresponding domain, and D i indicates the feature matrices of the corresponding FC layer, and 1 is a column vector with all elements as 1.
To accomplish malignancy prediction using mpMRI, an ensemble learning approach is employed to fuse the predictions of the three separated models (w.r.t T2, ADC, and hDWI).We trained the classifier module, as in Figure 5, using labeled source data.The FC layers in the source domain are employed, not only for cross-domain feature affinity, but also for malignancy classification.The cross-entropy loss is utilized to optimize the classifier module.Our classification loss L Cls can be defined as where variables ŷs i and r denote the ground truth and the malignancy prediction w.r.t. each source sample, respectively.
The ultimate purpose of CMD 2 A-Net is to accomplish accurate PLDC.To this end, we simultaneously trained the coarse segmentation module, domain transfer module, and classifier module.Note that, minimizing segmentation loss alone would cause overfitting to the source domain, and only optimizing domain transfer loss would lead to generalization degradation in the target domain.Therefore, joint optimization on the total loss could facilitate the training process to reach equilibrium, such that the domain-invariant features could be extracted to achieve accurate classification.The total loss L Total can be defined to where α and β are weighting hyperparameters of the total loss.Both of them were set to 0.5 in our experiments.
To leverage the benefits of multiple sequences, we utilized the weighted average ensemble learning-based method.The outputs of the three separated models were incorporated, thus contributing to the final ensemble prediction r ens as follows where r T , r A , and r B are the malignancy probability predictions of T2, ADC, and hDWI, for which the weights are 1, ω A , and ω B , respectively.Binary variables ω A , ω B ∈ f0, 1g are assigned based on the availability of ADC and hDWI.For example, if the samples include ADC but without hDWI, ω A ¼ 1 and ω B ¼ 0. Implementation Details: Our models (i.e., Mask-RCNN model, CM-Net, and CMD 2 A-Net) were trained using a GeForce GTX 1080 Ti GPU (Nvidia, California, USA) with API Keras. [48]For the Mask-RCNN model training, data augmentation with random rotation was applied on the 646 T2 image slices on I2CVB.All the slices were split into training, validation, and testing sets in the ratio of 7:2:1.The input shape of Mask R-CNN was set to 512 Â 512 pixels.Adam optimizer was applied with a learning rate of 10 À3 .The batch size was set to 4 and the total epoch was 200.During the training process, the model with the highest dice coefficient score on the validation set was retained.For CM-Net and CMD 2 A-Net training, the prostate regions from P-x, LC-A, and LC-B were scaled to 224 Â 224 pixels.Random rotation of {AE3°, AE6°, AE9°, AE12°, AE15°} was applied for data augmentation.Adam optimizer was chosen, and its learning rate was set to 10 À5 .The batch size was set as 2. In the training process of CM-Net, due to the limited sample size, all the slices were split into training and testing sets in the ratio of 4:1 using the hold-out method.The segmentation loss was optimized first to accelerate model convergence, and CM-Net with the pretrained coarse segmentation module was further trained.In terms of CMD 2 A-Net, we initialized both of its branches first using the weight of pretrained CM-Net, in order to facilitate its convergence.To be specific, we trained both the coarse segmentation module and classifier of CM-Net first, with the combined samples from both domains.Then, we optimized the total loss of CMD 2 A-Net with labeled source samples and unlabeled target samples.By cotraining all the modules, the model with the highest accuracy was saved for malignancy evaluation in the target domain.
We also offered our executable codes and files online available via GitHub, so as to allow any work extension or application by others.This open-sourced deep-learning-based model acts as an end-to-end system, with input from prostate mpMRI sequences (i.e., T2, ADC, and hDWI), and output to prediction results (i.e., prostate segmentation, coarse lesion detection, and malignancy estimation).The system supports multiformat inputs, including DICOM, jpeg, png, and jpg files.It is emphasized that no manual prostate segmentation or annotation is required.

Figure 1 .
Figure 1.Non-csPCa and csPCa mpMRI examples (from P-x and LC-A) of a) T2, b) ADC,and c) hDWI.The prostate gland is contoured with rectangles (in green) on the original slices (first and third row).The coarse lesion is contoured in red within the cropped prostate regions (second and fourth row) using the level set method, showing lesion morphological discrepancy of the benign and malignant samples.Apparent intersite heterogeneity (e.g., FoV, image intensity distribution) of the samples demonstrates domain shift between P-x and LC-A.

Figure 3
Figure3shows coarse lesion detection results of the accurately classified and misclassified examples.Two DA settings (i.e., P-x to LC-A, and P-x to LC-B) were selected as representatives for lesion detection evaluation.Results of the all-sequence-ensembled method were selected as a representative for analysis.In the correctly classified examples, coarse lesion contours could encircle the lesion ground-truth point in all sequences (as shown in Figure3a).However, in the unclassified examples, the coarse lesion position could not be precisely detected in most sequences as shown in the third row.In the example of LC-A, the lesion on the T2 image was correctly detected, but the lesion contours on ADC and hDWI maps were falsely identified.The possible reason is that the coarse lesion masks applied as the training ground truth could not depict the actual lesion contours accurately.Therefore, we can observe that accurate detection on ADC and hDWI also play a role in enhancing the ensembled classification, although lesion detection generally heavily relies on T2 images.In the future, robust weak label processing methods (e.g., deep extreme level set evolution method[39] ) will be employed.For the example from LC-B, undersegmentation of the prostate region can be found on the T2 image, which could lead to failed lesion detection.As the prostate regions on ADC and hDWI were transformed using T2, under/oversegmentation of the prostate gland on T2 would deteriorate the lesion detection in the other two sequences.Despite the inaccurate lesion detection on ADC and hDWI, it should be noted that the models with multisequences input

Figure 3 .
Figure 3. Coarse lesion detection results of a) accurately classified and b) misclassified examples in target domains, LC-A and LC-B, relative to the source, P-x.All-sequence-ensembled (T2, ADC, hDWI) approach is employed.In lesion detection results (first and second rows), the lesions (ground-truth) are pointed in green.The predicted coarse lesion regions are colored in yellow.Promising prediction of lesion region, i.e., containing the ground-truth in all sequences, can yield the higher correctness of classification as in (a).Moreover, undersegmented prostate regions marked with yellow boxes/contours (i.e., the example of LC-B in the third row) would also worsen the classification outcome.

Figure 4 .
Figure 4. Sample distribution before and after DA for sequences a) T2, b) ADC, and c) hDWI using t-SNE.Similar change in distributions can be observed in all the sequences.Before DA, sample distributions of the source (red dots) and target (blue triangles) are dispersed in separate clusters, indicating severe domain shift.The mixed and even distribution after DA demonstrates the effectiveness of CMD 2 A-Net in feature alignment.d) Indicates the impact of hyperparameters (i.e., α and β) in the loss sensitivity analysis.

Figure 5 .
Figure 5. Overview of the proposed CMD 2 A-Net using T2, ADC, and hDWI image inputs.Each image sequence network features two parallel branches with respect to the source and target domains.Three main modules in the source one: 1) coarse segmentation module for coarse lesion detection and feature alignment enhancement; 2) domain transfer module for knowledge transfer between domains; and 3) classifier module for malignancy classification.

Table 1 .
Characteristics of the five MRI datasets for prostate segmentation and PLDC.

Table 2 .
Comparisons of AUC using separate and joint learning approaches.

Table 3 .
Comparisons of AUC using six image preprocessing methods.
The bold/shading data indicate the maximum number in the columns.

Table 4 .
Malignancy classification results in the target domains in four combinations of source-target domains.
The bold/shading data indicate the maximum number in the columns

Table 5 .
AUC comparisons on malignancy classification (i.e., csPCa or non-csPCa) with the three existing models.