Improved unsupervised physics-informed deep learning for intravoxel-incoherent motion modeling and evaluation in pancreatic cancer patients

${\bf Purpose}$: Earlier work showed that IVIM-NET$_{orig}$, an unsupervised physics-informed deep neural network, was faster and more accurate than other state-of-the-art intravoxel-incoherent motion (IVIM) fitting approaches to DWI. This study presents: IVIM-NET$_{optim}$, overcoming IVIM-NET$_{orig}$'s shortcomings. ${\bf Method}$: In simulations (SNR=20), the accuracy, independence and consistency of IVIM-NET were evaluated for combinations of hyperparameters (fit S0, constraints, network architecture, # hidden layers, dropout, batch normalization, learning rate), by calculating the NRMSE, Spearman's $\rho$, and the coefficient of variation (CV$_{NET}$), respectively. The best performing network, IVIM-NET$_{optim}$ was compared to least squares (LS) and a Bayesian approach at different SNRs. IVIM-NET$_{optim}$'s performance was evaluated in 23 pancreatic ductal adenocarcinoma (PDAC) patients. 14 of the patients received no treatment between 2 repeated scan sessions and 9 received chemoradiotherapy between sessions. Intersession within-subject standard deviations (wSD) and treatment-induced changes were assessed. ${\bf Results}$: In simulations, IVIM-NET$_{optim}$ outperformed IVIM-NET$_{orig}$ in accuracy (NRMSE(D)=0.14 vs 0.17; NMRSE(f)=0.26 vs 0.31; NMRSE(D*)=0.46 vs 0.49), independence ($\rho$(D*,f)=0.32 vs 0.95) and consistency (CV$_{NET}$ (D)=0.028 vs 0.185; CV$_{NET}$ (f)=0.025 vs 0.078; CV$_{NET}$ (D*)=0.075 vs 0.144). IVIM-NET$_{optim}$ showed superior performance to the LS and Bayesian approaches at SNRs<50. In vivo, IVIM-NET$_{optim}$ showed less noisy and more detailed parameter maps with lower wSD for D and f than the alternatives. In the treated cohort, IVIM-NET$_{optim}$ detected the most individual patients with significant parameter changes compared to day-to-day variations. ${\bf Conclusion}$: IVIM-NET$_{optim}$ is recommended for accurate IVIM fitting to DWI data.


Introduction
The Intravoxel Incoherent Motion (IVIM) model (1) for diffusion-weighted imaging (DWI) shows great potential for estimating predictive and prognostic cancer imaging biomarkers (2)(3)(4)(5). In the IVIM model, DWI signal is described by a bi-exponential decay, of which one component is attributed to conventional molecular diffusion and the other to the incoherent bulk motion of water molecules, typically credited to capillary blood flow. Hence, IVIM provides simultaneously information on diffusion (D [mm 2 /s]; diffusion coefficient), capillary microcirculation (D * [mm 2 /s]; pseudo-diffusion coefficient) and the perfusion fraction (f [%]) without the use of a contrast agent (6)(7)(8). However, despite IVIM's great potential (2)(3)(4)(5), it is rarely used clinically. Two major hurdles preventing routine clinical use of IVIM are its poor image quality and the long time to fit the data (9)(10)(11). Tackling these shortcomings will help towards wider use of IVIM (12).
Recently, a promising alternative for IVIM fitting was introduced: estimating IVIM parameters with deep neural networks (DNNs). Initially, Bertleff et al. (17) introduced a supervised DNN for IVIM parameter estimation, in which the network was trained on simulated data for which the underlying parameters were known. However, the strong assumption of simulated training and test data being identically distributed could limit the network's performance in vivo, where noise behaves less ordered.
We solved this shortcoming in earlier work (11), where we used unsupervised physics-informed deep neural networks (PI-DNNs) (18,19). PI-DNNs formulate a physics-informed-loss-function that finds learned parameters through an iterative process. In this case, the PI-DNN used consistency between the predicted signal from the IVIM model and the measured signal as a loss term in the DNN. This resulted in an unsupervised PI-DNN capable of training directly on patient data with no ground truth: IVIM-NET orig . We demonstrated in both simulations and patient analysis that IVIM-NET orig is superior to the conventional LS approach and even performs (marginally) better than the Bayesian approach.
Furthermore, IVIM-NET orig 's fitting times were substantially lower (4 × 10 −6 seconds per voxel (11)) than the LS and Bayesian approaches. However, that proof of principle IVIM-NET study did not explore many hyperparameters and focused on volunteer data.
In this work, we hypothesize that IVIM-NET orig can be further improved by exploring the architecture of the network, its training features and other hyperparameters. To show this, we characterized the performance of IVIM-NET for different hyperparameter settings by assessing the accuracy, independence and consistency of the estimated IVIM parameters in simulated IVIM data. Finally, we compared the performance of our optimized IVIM-NET to LS and a Bayesian approach to IVIM fitting in patients with pancreatic ductal adenocarcinoma (PDAC) receiving neoadjuvant chemoradiotherapy (CRT), both in terms of test-retest reproducibility and sensitivity to treatment effects.

IVIM-NET
The code for all of our network implementations with some simple introductory examples in simulations and volunteer data are available on Github: https://github.com/oliverchampion/IVIMNET. We initially implemented the original PI-DNN (IVIM-NET orig ) (11) in Python 3.8 using PyTorch 0.4.1 (20). The input layer consisted of neurons that took the normalized DWI signal S(b)/S(b=0) as input, where S(b) is the measured signal at diffusion weighting b (b value). The input layer was followed by three fully connected hidden layers. Each hidden layer had a number of neurons equal to the number of measurements (b values and the number of repeated measures) and each neuron, in turn, contained an exponential linear unit activation function (21). The output layer of the network consisted of the three IVIM parameters (D, f, D * ). To enforce the output layer to predict these IVIM parameters, two steps were taken. First, the absolute activation function was taken of the neuron's output (X) to constrain the predicted parameters, e.g. to compute D: Second, a physics-based loss function was introduced that computed the mean squared error between the measured input signal, S(b), and the predicted IVIM signal S net (b), which was obtained by inserting the predicted output parameters into the normalized IVIM model. Hence: here B is the total number of image acquisitions (b values and repeated scans).
Next, we evaluated whether seven novel hyperparameters (Table 1; Figure 1) of IVIM-NET improved fitting results. First, instead of normalizing the input and predicted IVIM signals by S0, we added S0 as an additional output parameter, to allow the system to correct for noise in S(b=0). Second, to restrict parameter values to physiologically plausible ranges, scaled sigmoid activation functions instead of absolute activation functions were used to constrain the predicted parameters of the output layer (Table   1), e.g. to compute D: where D min and D max are the fit boundaries. Bound intervals were D: -0.4 × 10 -3 -5 × 10 -3 mm 2 /s, f: -5-70%, D * : 5 × 10 -3 -300 × 10 -3 mm 2 /s, and S0: 0.7-1.3. The range was chosen broader than expected in vivo to compensate for the decreasing gradients at the asymptotes of the sigmoid function. Third, we varied the number of hidden layers between 1 and 9. Fourth, we used dropout regularisation (22) which randomly removes a set percentage of network-weights each iteration during training. Fifth, we used batch normalization (23) which normalizes the input by re-centering and re-scaling, and, consequently, preserves the representation ability of the network. Sixth, to reduce unwanted correlation between estimated parameter values, we implemented an alternative network architecture in which parameter values were predicted, in parallel, by independent sub-networks (Table 1; Figure 1). Furthermore, we evaluated different learning rates (LR) of the Adam optimizer (24), ranging from 1 × 10 -5 to 3 × 10 -2 , and with constant β=(0.9, 0.999).
In traditional deep learning, training and evaluation are done on separate datasets, but as this is an unsupervised DNN approach, training was done on the same data as evaluation. So, for simulations, these were simulated data, and in vivo, these were in vivo data. The data were split into two datasets, one containing 90% of the data used for training; and one containing 10% of the data used for validation.
The validation set was used only to determine early stopping criteria. Early stopping occurred when the loss function did not improve over 10 consecutive training epochs. Given the large amount of training data and the limited number of network parameters, instead of feeding all the data at every epoch, we validated the network every 500 mini-batches. So, effectively the network saw 500 × 128 IVIM curves in between validations. Validation was done on the entire validation set. as both positive and negative deviations from zero are equally undesirable. Some networks always returned the same value for D * , independent of the input data (Supporting Information Figure S1). For such cases, ρ is technically undefined. As these cases are undesirable ρ was set to 1.

Simulations: characterization and optimization
As training a DNN is a stochastic process (random initialization, random dropout, random minibatches), training on the same dataset usually results in different final network-weights, and consequently, different predictions on the same data. To assess the consistency of estimated parameter values, each network variant was trained 50 times on identical data, where each repeat had a new random initialization, dropout and mini-batch selection. The normalized coefficient of variation of the network over the repeated simulations (CV NET ) was taken as a measure of the consistency of each estimated parameter: where x̄t rue is the mean value of the simulated IVIM parameter, i=1,…,n indicated the summation over the different decay curves with n the number of simulated curves, j=1,…,m indicates the summation over the repeated neural network trainings with m the number of repeated trainings, x i,j is the j th repeated prediction of the i th simulated decay curve, x̄i is the mean over the repeated m predictions of the ith simulated signal curve. As the LS and Bayesian approaches predict the same values for repeated fits to the same dataset, CV NET was zero for them.
As a result of the repeated training, we obtained 50 values for the NRMSEs and ρ's. Therefore, the median and interquartile range were reported.
As a baseline for comparison, we evaluated the IVIM parameters (D, f, D * ) in IVIM-NET orig , the LS and Bayesian approaches. We used the Levenbergh-Marquant non-linear algorithm for the LS fit (26,27). For the Bayesian approach, we used the algorithm from previous work (11). For both the LS and Bayesian approaches S0 was included as a fit parameter. The Bayesian approach used a data-driven lognormal prior for D and D * , and a beta distribution for f and S0. The prior distributions were determined empirically by fitting these distributions to the results from the LS approach on the same dataset. The maximum a posteriori probability was used as an estimate of the IVIM parameters. The LS and Bayesian approaches were performed with fit boundaries of D: 0 × 10 -3 -5 × 10 -3 mm 2 /s, f: 0-70%, D * : 5 × 10 -3 -300 × 10 -3 mm 2 /s, and S0: 0.7-1.3.
After baseline characterization, IVIM-NET was optimized by testing various combinations of the hyperparameters (Table 1; Figure 1). Previous studies reported reliable SNR values of IVIM in the abdomen between 10 and 40 (28)(29)(30)(31). So, to simulate reliable abdominal IVIM signals, an SNR of 20 was chosen for hyperparameter evaluation. We trained the network on the simulated signals using every combination of the following options: fitS0 parameters, absolute or sigmoid constraints, parallel network, dropout and batch normalization -while fixing the number of hidden layers to 3 (used in IVIM-NET orig , Table 1) and the LR to 1 × 10 -4 . In an exploratory phase, we found that reducing the LR from 1 × 10 -3 (IVIM-NET orig ) to 1 × 10 -4 was essential for obtaining networks with improvements in accuracy, independence and consistency. Based on these results, the best parameter combination was selected manually considering the combination of 1) NRMSE, 2) ρ and 3) CV NET .
With the best options for the fitS0 parameters, constraints, parallel network, dropout and batch normalization, we tested the performance of the network as a function of the LR and number of hidden layers ( Table 1). From those results, we finally selected the best performing optimized network: IVIM-NET optim .

Verification in patients with PDAC
To test how IVIM-NET optim performed in vivo two IVIM datasets of patients with PDAC were used: one to assess test-retest reproducibility, and one to see whether we can detect treatment effects. Both studies were approved by our local medical ethics committee and all patients gave written informed consent.
Both datasets (NCT01995240; NCT01989000) were published earlier (9,33,34). The first dataset consists of 14 patients with locally advanced or metastatic PDAC who underwent IVIM in two separate imaging sessions (average 4.5 days apart, range: 1-8 days) with no treatment in-between. The second dataset consisted of 9 PDAC patients with (borderline) resectable PDAC who received CRT as part of the PREOPANC study (35) where patients were scanned before and after CRT. DWI images were co-registered to a reference volume consisting of a mean DWI image over all b values using deformable image registration in Elastix (36). A radiologist (10 years' experience in abdominal radiology) and researcher (4 years' experience in contouring pancreatic cancer) drew a region of interest (ROI) in the tumor in consensus. IVIM parameter maps of D, f and D * were derived using the LS approach, Bayesian approach and IVIM-NET optim . Background voxels were removed automatically before fitting by removing voxels with S(b=0) < 0.5 × median(S(b=0)). Fitting was done without averaging over the diffusion directions. IVIM-NET optim was trained on all combined patient data. All computations were carried out on a single core of a conventional desktop computer (CPU: Intel Core i7-8700 CPU at 3.20 GHz). The average fitting time of each algorithm was recorded.
Further analysis was performed with the median parameter values from within the ROIs. To evaluate test-retest repeatability, intersession within-subject standard deviation (wSD) (37) was calculated for each IVIM parameter using the data from the patients with repeated baseline scans. Bland-Altman plots were plotted for patients from both cohorts. We calculated the 95% confidence intervals (95CI) from the patients with repeated scans at baseline (assuming zero offsets). In the cohort receiving treatment, we used a paired t-test to test whether parameters had significantly changed due to treatment within the cohort. Furthermore, patients from the treatment cohort were added to the Bland-Altman plots and individual patients who had changes exceeding the 95CI were considered to have significant changes in tumor microstructure (38).

Simulations: characterization and optimization
The original network, IVIM-NET orig , showed substantially lower NRMSE for all estimated parameters than the LS and Bayesian approaches. However, IVIM-NET orig had strong correlations between D * and f (high ρ(D * ,f); Table 2 and Figure 2D) and had considerable CV NET .
The NRMSE, ρ and CV NET for all hyperparameter combinations are shown in the Supporting Information Figures S2-S7. From this data, we selected IVIM-NET optim (Table 1) as optimal for IVIM-NET. IVIM-NET optim resolved the high dependency between D * and f found in IVIM-NET orig (Table 2; Figure 2D, E) and substantially reduced the NRMSE and CV NET . Figure 3 ilustrates the effect of changing a single option away from IVIM-NET orig (left side) or away from the selected optimum (right side). It is clear that the reduced ρ(D * ,f) cannot be attributed to a single parameter, but was a result of the combination of the proposed changes. For IVIM-NET orig , enabling the additional fit parameter S0, using sigmoid constraints and using the parallel network architecture all improved the NRMSE of all estimated IVIM parameters. Yet, the ρ(D * ,f) remains high for single deviations away from IVIM-NET orig and the network's consistency remains relatively poor (Figure 3). Single changes away from IVIM-NET optim can lead to marginally better NRMSE, lower ρ or lower CV NET (Figure 3), but only at a cost to the other two attributes. Supporting Information Figure S5-S7 show that the number of hidden layers did not have a substantial effect on the network's performance, although generally, an increase in the number of hidden layers resulted in a higher ρ, whereas a decrease resulted in higher NRMSE and CV NET . For the LR (Supporting Information Figure S5-S7), the order of magnitude was important as too high/low learning rates caused higher NRMSEs and less consistency.
Regardless of which b-value distribution was used, the abovementioned results hold and IVIM-NET optim outperformed IVIM-NET orig (Supporting Information Figure S8

Verification in patients with PDAC
The An example of parameter maps computed with the LS approach, Bayesian approach and IVIM-NET optim of a PDAC patient before receiving CRT is presented in Figure 5. The parameter maps of IVIM-NET optim were less noisy and more detailed than those computed by the LS and Bayesian approaches.
In the test-retest cohort, IVIM-NET optim showed the lowest wSD for D and f (Table 3), while the Bayesian approach had the lowest wSD for D * . When averaging IVIM parameters for the repeated patient scans, IVIM-NET optim computed a higher D, lower f and higher D * than the LS and Bayesian approaches ( Table 4). The repeated scans are visualized as black x's in the Bland-Altman plots, together with their 95% CIs in Figure 6.
When taking into account the CRT patients as a whole, IVIM-NET optim found a significant (P < 0.05) increase in mean f after treatment, whereas the LS approach found a significant increase in D after treatment. Although non-significant, IVIM-NET optim 's P-value for D was 0.07.

Discussion
This study is the first to show the potential clinical benefit of DNNs for IVIM fitting to DWI data in a patient cohort. We successfully developed and trained IVIM-NET optim , an unsupervised PI-DNN IVIM fitting approach to DWI that predicts accurate, independent and consistent IVIM parameters in Together with the fact that IVIM-NET optim detected an almost significant positive trend in D for the whole cohort of patients receiving CRT, these findings strongly suggest IVIM-NET optim as a good alternative for IVIM fitting in PDAC patients. Findings from other studies support this increase in D (39) and f (40) during CRT in PDAC patients. In general, PDACs tend to have lower diffusion due to the impeded water movement of compressing cells (41). Furthermore, PDACs are typically hypoperfused, due to significant tumor sclerosis creating elevated interstitial pressure, which compresses tumor feeding vessels (40,42). Effective treatment leads to necrosis, which in turn leads to lower cell densities and reduced interstitial pressure, and consequently increased diffusion (43,44) perfusion (45). Not all patients demonstrated a significant change for both diffusion and perfusion parameters induced by treatment. Therefore, using IVIM to discriminate between individual treatment effects may be feasible in the future. As the treatment of these patients was part of induction therapy and patients received surgery directly after, overall survival cannot be attributed purely to the CRT effect. Hence, given the limited number of patients and the diluted treatment effect, we did not compare overall survival between patients that showed potential treatment effects and others.
Our previous work (34) showed that the LS approach to IVIM fitting was sensitive to individual treatment effects. However, high wSD was found which limited the study to detect individual treatment effects. Furthermore, this work (34) used denoised DWI b-images that substantially degraded image sharpness and tumor boundaries were harder to detect (e.g. compare figures from this work to example figures from (9)). Conversely, our present study demonstrates that DNNs can estimate parameter maps directly from the noisy data resulting in sharp high-quality IVIM parameter maps.
Although IVIM-NET showed consistently better results both in simulations and in vivo, IVIM-NET predicts different IVIM parameters in repeated training. This causes a new sort of variability that, until now, was not an issue in fitting parameter maps. There may be methods to mitigate this variability.
First, when probing treatment response, we would advise using one network such that this additional effect is not different pre and post-treatment. Second, to reduce the variation, one could consider taking the median prediction from 10 repeated trainings instead. We did so in an exploratory study where we formed 5 groups of 10 networks and showed that the median of 10 networks was substantially more consistent, with CV NET values of 7.6 × 10 -3 , 6.3 × 10 -3 and 18.7 × 10 -3 for D, f and D * , respectively.
Having a set of networks will also allow the user to estimate the variation on the predicted parameter.
Finally, although we see this additional uncertainty, we would like to stress that it is secondary to the overall error of the LS approach, which is apparent from the fact that in the simulations, all 50 instances of IVIM-NET optim had lower NRMSE than the LS approach.
IVIM-NET optim was comparable or outperformed IVIM-NET orig at SNRs 8-100 and superior to the LS and Bayesian approach for SNRs 8-50 ( Figure 4). However, at extremely high SNR (SNR = 100; Figure   4), the LS approach outperformed IVIM-NET. The Levenberg-Marquardt algorithm for the LS function is an iterative function that finds a minimum of the squared difference. For a relatively smooth loss landscape and high SNR signal, the LS algorithm is designed to find the correct parameter estimates. However, at low SNR, the LS approach has trouble finding the correct parameters. This occurs either because the loss landscape is no longer smooth and hence it gets stuck in a local minimum, or, what we believe is more common, the noise has changed the signal such that the global optimum no longer is nearby the ground truth parameters. On the other hand, a DNN consists of a complex system that needs to encompass estimating the IVIM parameters for all voxels. It turns out that having been trained on all voxels enables better estimates for individual voxels at low SNR. We expect that DNNs focus on more consistent minima with parameter values that are more frequently observed. This might be similar to data-driven Bayesian fitting approaches (15,46). Conversely, our DNN seems to reach a maximum accuracy at high SNR. Potentially more complex DNNs that are optimized with simulations done at high SNR can comprise the subtle signal changes of the IVIM parameters at these SNRs. However, typical SNR values for IVIM data are < 50. Therefore, our findings suggest that using our DNN instead of the LS and Bayesian approaches for IVIM fitting to DWI-data would be beneficial in a clinical setting.
The choice of the hyperparameters for IVIM-NET optim was based on an optimal combination of accuracy, independence and consistency across all IVIM parameters. However, other hyperparameter options may be more appropriate when characterizing an individual IVIM parameter (e.g. when an observer is only interested in D and IVIM is barely used to correct for perfusion). Figures S1-S6 of the Supporting Information can help interested readers select the best network for their purposes.
The high dependency between D * and f that appears in IVIM-NET orig could not be attributed to a single cause. Initially, we expected that this dependency originated in the fully connected shared hidden layers of the original network. Although ρ decreased when adding the 'parallel network architecture' to IVIM-NET orig , ρ remained substantial (Figure 3). These dependencies between estimated parameters are not per se specific to DNNs. For instance, similar dependencies between D * and D or f were found in a different data-driven Bayesian fitting approach (9). For IVIM-NET optim , these dependencies were small at clinical SNR values and similar to those of the LS approach.
Although simulation studies in parameter estimation are extremely valuable as the underlying parameter values are known, they also come with limitations. One limitation is that the noise characterization of real data can be diverse and hard to model. For instance, DWI artifacts caused by motion are not considered in simulations and may affect the results of fitting the IVIM model (47). We are aware that Rician noise could be added to generate more realistic noise. However, to compare models, we believe that adding Rician noise and hence additional systematic errors, only complicates the comparison, as errors from the fit methods could cancel out errors induced by the Rician noise. We do not expect that adding Rician noise would have had a significant effect on our results, as we do not expect any of the fit approaches to deal differently with Rician noise. Hence, we employed Gaussian noise for a fairer and easier comparison between algorithms. Nonetheless, our in vivo results show that in the presence of real noise, IVIM-NET optim is superior to the alternatives. Another limitation is the underlying assumption that the IVIM model is complete and hence that data are perfectly bi-exponential. In reality, the IVIM model is a simplification and real data will be more complex.

Conclusion
We substantially improved the accuracy, independence and consistency of both diffusion and perfusion parameters from IVIM-NET by changing the network architecture and tuning hyperparameters. Our new IVIM-NET optim is considerably faster, and computes less noisy and more detailed parameter maps with substantially better test-retest repeatability for D and f than alternative state-of-the-art fitting methods. Furthermore, IVIM-NET optim was able to detect more individual patients with significant changes in the IVIM parameters throughout CRT. These results strongly suggest using IVIM-NET optim for detection of treatment response in individual patients. To stimulate wider clinical implementation of IVIM, we have shared IVIM-NET online and encourage our peers to test it.      (Table 1). In this example, the input signal, consisting of the measured DWI signal, is feedforwarded either through (A) a parallel network design where each parameter is predicted by a separate fully connected set of hidden layers or (B) the original single fully-connected network design. The blue circles indicate an example of randomly selected neurons for dropout. In this example, the output layer consists of four neurons with either absolute (Eq. 1) or sigmoid activation functions (Eq. 4) whose values correspond to the IVIM parameters. Subsequently, the network predicts the IVIM signal (Eq. 3) which is used to compute the loss function (Eq. 2). With the loss function, the network trains the PI-DNN to give good estimates of the IVIM parameters.      Figure S1: Plots of the estimated IVIM parameters where no Spearman rank correlation coefficient (ρ) can be determined and is set to a ρ of 1.