Understanding diffusion‐weighted MRI analysis: Repeatability and performance of diffusion models in a benign breast lesion cohort

Diffusion‐weighted MRI (DWI) is an important tool for oncology research, with great clinical potential for the classification and monitoring of breast lesions. The utility of parameters derived from DWI, however, is influenced by specific analysis choices. The purpose of this study was to critically evaluate repeatability and curve‐fitting performance of common DWI signal representations, for a prospective cohort of patients with benign breast lesions. Twenty informed, consented patients with confirmed benign breast lesions underwent repeated DWI (3 T) using: sagittal single‐shot spin‐echo echo planar imaging, bipolar encoding, TR/TE: 11,600/86 ms, FOV: 180 x 180 mm, matrix: 90 x 90, slices: 60 x 2.5 mm, iPAT: GRAPPA 2, fat suppression, and 13 b‐values: 0–700 s/mm2. A phase‐reversed scan (b = 0 s/mm2) was acquired for distortion correction. Voxel‐wise repeat‐measures coefficients of variation (CoVs) were derived for monoexponential (apparent diffusion coefficient [ADC]), biexponential (intravoxel incoherent motion: f, D, D*) and stretched exponential (α, DDC) across the parameter histograms for lesion regions of interest (ROIs). Goodness‐of‐fit for each representation was assessed by Bayesian information criterion. The volume of interest (VOI) definition was repeatable (CoV 13.9%). Within lesions, and across both visits and the cohort, there was no dominant best‐fit model, with all representations giving the best fit for a fraction of the voxels. Diffusivity measures from the signal representations (ADC, D, DDC) all showed good repeatability (CoV < 10%), whereas parameters associated with pseudodiffusion (f, D*) performed poorly (CoV > 50%). The stretching exponent α was repeatable (CoV < 12%). This pattern of repeatability was consistent over the central part of the parameter percentiles. Assumptions often made in diffusion studies about analysis choices will influence the detectability of changes, potentially obscuring useful information. No single signal representation prevails within or across lesions, or across repeated visits; parameter robustness is therefore a critical consideration. Our results suggest that stretched exponential representation is more repeatable than biexponential, with pseudodiffusion parameters unlikely to provide clinically useful biomarkers.

parameter robustness is therefore a critical consideration. Our results suggest that stretched exponential representation is more repeatable than biexponential, with pseudodiffusion parameters unlikely to provide clinically useful biomarkers.

K E Y W O R D S
breast cancer, breast neoplasms, diffusion magnetic resonance imaging, biomarkers, magnetic resonance imaging, repeatability

| INTRODUCTION
Diffusion-weighted imaging (DWI) is an established component of magnetic resonance imaging (MRI) protocols in oncology. [1][2][3] Deriving meaningful diffusion parameters from signal curves, however, requires an appreciation of the limitations of the diffusion models being used, and what relationship they have to tissue microstructure. For clinical evaluation of breast cancer, DWI studies have shown an ability to characterise lesions 4,5 and have also displayed sensitivity for measuring and predicting response to treatment. [6][7][8][9] Commonly, DWI is quantified using the apparent diffusion coefficient (ADC), and is used as an indication of tissue cellularity. It is important to remember that ADC is an empirical parameter based on an assumption of Gaussian (random) diffusion and is an oversimplification in many tissues.
More complex representations of DWI signal attempt to quantify non-Gaussian behaviour as either perfusion components to the signal (the intravoxel incoherent motion, or IVIM model), 10 or allow for a distribution of diffusion coefficients within each voxel (the stretched exponential representation). 11 These representations, summarised in Table 1, have the potential to offer more sensitive imaging biomarkers for treatment response, 12,13 specific to changes in vascularity, compartmentalised and distributed diffusion processes, and contributions from the extracellular matrix and collagen.
Using more advanced DWI makes acquisition and analysis more challenging, requiring the acquisition of a larger number of b-values, and currently additional time for analysis performed outside of the normal clinical data workflow. The representations themselves also have limitations that are often overlooked. Assumptions in models may be violated in certain tissues, and DWI signal is known to be sensitive to acquisition parameters, such as diffusion time (Δ) and echo time (TE), which are often not standardised. 14,15 It is thus important that the cost of performing extended DWI protocols is critically examined in the context of practical utility. While the sensitivity and specificity of DWI-derived biomarkers are critical for clinical decision-making, the repeatability of these parameters is similarly crucial for translation into clinical use. 16 Unfortunately, incorporating repeatability data into clinical studies is challenging and places an extra demand on patients. 17 This is especially true for patients with malignant lesions that may be changing faster and where there is less time for additional MR examinations before treatment.
In this study, we report the findings from fitting the monoexponential (ADC), biexponential (IVIM), and stretched exponential signal representations to diffusion signal acquired with multiple b-values in a cohort of patients with benign breast lesions, and use repeated measurements to assess repeatability. In this setting, the benign breast lesions present a useful and accessible model system, which may be considered simpler and better behaved than malignant lesions, with which to highlight the complexity of diffusion analysis in a wider context.

| Patient cohort and MR protocol
This prospective study was approved by the Regional committee for Medical and Health Research Ethics (REC) of central Norway (identifier 2011/568). Twenty-one informed and consenting patients (median age 26, range 19-50 years) with benign breast lesions (fibroadenomas, confirmed by histopathological assessment with core needle biopsy, or clinical follow-up) were recruited from September 2016 to October 2017.
Alongside a standard clinical protocol, including anatomical (T 2 -weighted) and dynamic contrast-enhanced imaging (DCE-MRI, with one baseline and seven postcontrast images at a time resolution of 1 min), the clinical MRI examination ('visit 1') included a multiple b-value DWI protocol, which is described below. The diffusion protocol was repeated exactly during an additional examination ('visit 2') after 7 (median; range [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] days with no intervening treatment. The hospital where this study was performed does not schedule examinations with regard to the menstrual cycle, and so these data were not recorded. All imaging was performed on a clinical 3-T scanner (Skyra, Siemens Healthineers) equipped with a 16-channel breast coil. The repeatability thus explicitly represents elements from underlying physiology and from patient set-up and scanner performance, to most accurately reflect clinical use. DWI acquisition, acquired prior to any contrast administration, used a spin-echo echo planar imaging (SE-EPI) sequence, in unilateral sagittal orientation, with parameters including: bipolar diffusion encoding, TR/TE: 11,600/86 ms, FOV: 180 x 180 mm, matrix: 90 x 90, slice thickness: 2.5 mm, slices: 60, iPAT: GRAPPA 2, spectral attenuated inversion recovery [SPAIR]), and 13 b-values: 0, 10,20,30,40,50,70,90,120,150,200, 400 and 700 s.mm −2 , thereby placing an emphasis on fine sampling of the lowest b-values and retaining higher signal at the maximum diffusion-weighting. Diffusion times were Δ = 40.8 ms and δ = 19.5 ms. An additional phase-reversed, geometry-matched scan without diffusionweighting (b = 0 s.mm −2 ) was acquired to allow for correction of geometric distortion arising from susceptibility boundaries. 18 Individual images with orthogonal diffusion gradient directions were acquired, and were not trace-weighted before analysis. The total scan time for the DWI sequence was 9 min.

| Diffusion analysis
Diffusion images were corrected for distortion using the method reported by Teruel et al. 18 Volumes of interest (VOIs) were drawn on a diffusion-weighted image for all slices of the tumour in both examinations, avoiding cystic regions and using DCE and T 2 images as a reference.
VOIs were drawn by a scientist with 12 years' experience of cancer imaging (NPJ) and validated by an experienced breast radiologist (AØ). The repeatability of the delineated volume was assessed by a log-normalised repeat-measures coefficient of variation (CoV) suitable for constrained positive values. 17,19 Multiple representations of the DWI signal decay (see Table 1 for details of signal representations) were fitted to the DWI data on a voxelwise basis, and also for the mean VOI signal in each tumour. The fitted representations included an exponential decay (giving ADC), a biexponential decay (as per the IVIM model, giving the volume fraction f of the pseudodiffusion compartment with coefficient D* in addition to the true diffusion coefficient D), and stretched exponential decay (giving the stretching exponent α and distributed diffusion coefficient DDC). For these representations, all b-value data were used in the fitting. All signal fitting was performed using a Levenberg-Marquardt algorithm using MATLAB (Mathworks, Natick, MA, USA); for IVIM, initial parameter estimates were provided by a segmented approach; these results are also reported. 4 No parameter restraints were used, to avoid unintentional bias from voxels returning limiting values, and to allow for critical assessment of the information content of the acquired data.
T A B L E 1 Signal representations used in this work, discussion of the individual parameters, as well as the interpretation and applicability of the representation. Note that while apparent diffusion coefficient (ADC) and intravoxel incoherent motion (IVIM) representations have an underlying physiological meaning and can thus be called models, the stretched exponential is a purely mathematical representation Diffusion is isotropic and Gaussian in nature, giving monoexponential signal decay with b-value. ADC is the diffusion coefficient in an unimpeded single compartment system that would lead to the monoexponential DWI signal decay that best matches the observed signal.
ADC is sensitive to impedence of water, arising from tissue structure (cellularity, tortuosity, extracellular matrix, etc.). ADC is nonspecific and contains influence from all factors (see above).
Diffusion is isotropic. Two compartments: (1) true (Gaussian) compartment with volume fraction (1-f ) and empirical diffusion coefficient D; and (2) pseudodiffusion compartment of volume fraction f with coefficient D*. Perfusion in microcapillaries is equal in all directions (incoherent thus can be modelled as a pseduodiffusion (i.e. monoexponential).
D is perfusion-insensitive, thus is true tissue diffusion. f reflects (but is not a direct measure of) tissue perfusion. D* reflects nondiffusion process, including microcapillary flow.
Nonmonoexponential signal decay can be modelled by a summation of a distribution of curves with a continuum of diffusion coefficients. No underlying physiological interpretation for either parameter.
Note: for all representations of diffusion-weighted MRI (DWI) signal, S 0 is the signal intensity in the absence of diffusion weighting (i.e. b = 0 s.mm −2 ), but contains a term allowing for T2 decay over the imaging echo time TE, which is assumed constant across all DW images. b is the applied diffusion weighting.

| Statistical analyses
Following fitting, the optimal representation was determined by Bayesian information criterion (BIC), essentially comparing 'goodness-of-fit' using signal residuals but allowing comparison across representations that contain different numbers of variables (to avoid favouring overfitting).
Repeat-measure CoV and 95% limits of agreement (LoA) were calculated for all DWI parameters for the values at every percentile across the VOI histogram for voxel-wise results, and for the mean and median VOI values. 17,20 Median diffusion parameters were compared using a student's t-test, with a significance threshold of 0.05. All statistical analyses were performed in MATLAB using standard functions.

| Cohort characteristics and repeatability of lesion volumes
One patient withdrew before completing both scans, and data from one patient were excluded owing to incorrect slice positioning; eight patients had additional lesions within the imaging FOV that were analysed independently, and five lesions were excluded owing to being small or not clearly delineable on DWI, giving a total of 26 lesions with repeated datasets. The volumes of the benign lesions ranged from 0.29 to 25.39 cm 3 , with CoV of 13.88%, indicating good VOI repeatability; the Bland-Altman plot is given in Figure 1, along with illustrative lesion delineations in contribution. For summarised DWI results see Table 2, with full details per lesion in Table S1; note that the size of this cohort defies any useful inference from the values themselves, but that a range of values for each parameter is clearly seen across and within lesions, indicating both inter-and intratumour heterogeneity.

| Repeatability of DWI biomarkers
CoVs for the medians of all DWI parameters for the cohort are given in Table 2; for the measures ADC, D and DDC, the returned CoV is low (<10%). For the pseudodiffusion parameters (f and D*), CoV is far higher (>50%), whereas the CoV of the stretching exponent α is also low. Using

F I G U R E 3 (A) Example images showing preferred signal representation choice (lowest Bayesian information criterion value) for each voxel for
illustrative tumour slices in repeated visits (the same lesions and slices as Figure 2). There is no overall concordance of the best performing diffusion-weighted MRI (DWI) model within or across slices, or between visits. All cohort lesions contain voxels that are best described by each representation. (B) Real data and fitted curve for apparent diffusion coefficient (ADC), intravoxel incoherent motion (IVIM), and stretched representations for one voxel giving the lowest Bayesian information criterion (BIC) for each (left-to-right); the difference between residuals is not necessarily large

| DISCUSSION
The results from this study illustrate the complexities of interpreting clinical DWI, and the danger of neglecting repeatability assessments. The high CoVs of pseudodiffusion parameters from the IVIM model indicate that only large differences or changes in these values can be treated as reliably detectable, and thus will limit their utility for classification and assessment of treatment response. The median IVIM D, together with ADC, was shown to be most repeatable, as well as the central portion of the VOI histogram. It is worth noting that the stretched exponential model parameters DDC and α are more repeatable than IVIM for this b-value range, likely arising from the fact that both parameters are fitted using the full data range. The stretched exponential representation is not a true biophysical model, because α does not have a specific interpretation describing tissue structure, but nevertheless may provide useful biomarkers that are sensitive to microstructural changes. The exponent α acknowledges and attempts to capture tissue complexity by providing an empirical assessment of the apparent range of diffusion coefficient distributions present in a voxel, and it is true that tissue microstructure is on a smaller scale than the typical DWI voxel size. AIso, it should not be forgotten that ADC itself is an empirical parameter, as are IVIM parameters, and that an explicit biophysical interpretation is not a requirement for good repeatability or reliable detection of clinically relevant changes. 21 Alongside the variable repeatability of the diffusion parameters, these data show that there is no objectively best-performing signal representation for benign breast lesions. This is true for voxels within individual lesions, for lesions across the cohort, and even for the same lesion between repeated scans, illustrating the difficulty of attempting to compare across scans at a voxel level. This is an important result, especially because these lesions are expected to be well behaved (i.e. slow growing), and more homogeneous than a cohort of malignant lesions. Analysis Both have the effect of obscuring the degree of information contained in the data, allowing erroneous inferences (and ultimately conclusions) from the examination. Such assumptions are often built into study design before any data are acquired, and can easily be lost if not attended to carefully. We ourselves recognise that this study implicitly assumes that results should derive from the use of a single representation across all voxels and scans, which, while standard practice and an intuitive choice, is not a given and should be a topic for deeper consideration. 22 The consensus in the extracranial MR imaging community that DWI is able to provide biomarkers more sophisticated than ADC 23 is not yet matched with a similar agreement on how best to perform such analysis. Optimal b-value selection depends on both the nature of the tissue being studied and the intended analysis, [24][25][26][27] and fitting of DWI signal curves can be performed using many different algorithms. 17,28-32 These discussions essentially arise because of the inherent difficulty of obtaining high-quality data at low b-values. Pulsatile motion in larger vessels, as well as cardiac and respiratory motion, decreases data quality and is difficult to fully remove using flow-compensated and triggered techniques. 33,34 Beyond simple summary statistics, histogram and spatial analyses within lesions further increase the demand for high-quality data at a voxel level, or the need for utilising prior assumptions of DWI parameter behaviour. 28,29,35 Some research articles report that the IVIM model provides clinically useful biomarkers [36][37][38][39][40][41] in breast, but a closer reading of some studies shows them supporting the use of D only, and not the pseudodiffusion parameters f and D*. [36][37][38]40 Results in these studies might therefore better be described as using a perfusion-insensitive ADC (and not IVIM), which can be acquired using only two b-values (e.g. b = 200 and 800 s.mm −2 ), and is in line with our result that the lowest CoV was obtained for the segmented-IVIM D. In such cases, it may be argued that the imaging time is better spent acquiring multiple averages at fewer points. By contrast, other studies specifically cite pseudodiffusion parameters as providing significant utility, 12,41,42 leaving the issue of the value of the full IVIM model unclear. Methods that split the difference between complexity and brevity, such as the three b-value minimal acquisition 25 that does not attempt to fit D*, and relative enhanced diffusivity, 43 may represent a pragmatic partial solution, at the expense of precluding retrospective alternate analyses.
The underlying assumptions associated with the DWI signal representation and algorithm choices are not always questioned, 44 and may not be valid for the chosen tissue. 14 Nonetheless, DWI studies commonly 'lock in' to a certain interpretation before the data are acquired, or any longitudinal changes observed, for example, the formation of necrotic regions following treatment; here, any non-Gaussian model should be robust to observed changes and still give well-behaved parameters that can, for example, capture the collapse of a specific diffusion component. The great potential of DWI in oncology is tempered by a need for critical appraisal of the model applicability in the context of the application, and it is F I G U R E 6 Example comparison (from patient 8) of diffusion-weighted MRI (DWI) parameter value agreement across the histogram percentiles; measurements are represented by solid (visit 1) and dashed (visit 2) lines. The light grey shading thus represents discrepancy between volume of interest (VOI) percentiles, indicating poorer repeatability. Unusual values and behaviour of intravoxel incoherent motion (IVIM) parameters (e.g. highlighted pink regions on the truncated graph for f), including values that are not physiologically possible, support the conclusion that this model is not optimal for this tumour. Outlying values were exhibited for the extremes of the histogram even for well-behaved parameters are indicative of noise in the data, including partial volume effects for this reason that repeatability and reproducibility studies are vital. It has been demonstrated that repeatability of the ADC in extracranial oncological applications can be generalised, 16 but similar work for non-Gaussian diffusion analyses remains to be fully explored. 20 The main limitation of this study is that it reflects only a small cohort of benign breast lesions, some of which were in themselves small, where nonmonoexponential signal behaviour has been shown to be less prominent than for malignant lesions that are of greater clinical interest. 44,45 Nevertheless, the signal curves are clearly shown to be nonmonoexponential, and to dismiss the utility of a benign lesion cohort on the basis that the pseudodiffusion component is smaller illustrates precisely the problem with beginning from an unstated assumption of the correctness of one particular model (in this case IVIM). Fitting the mean VOI signal in these data shows the stretched exponential as the best fit, but this may indicate the heterogeneity among voxels, not necessarily non-Gaussian diffusion itself. This is the premise of the stretched exponential model, which assumes heterogeneity within voxels. Benign lesions are more suited for repeatability studies because they are less likely to change between repeated scans, and thus better report on the repeatability of the elements that comprise the clinical scanning process itself. In reality, it is difficult to acquire repeated scans for malignant lesions given the need to begin treatment promptly, when weighed against the (perception that the) additional MR examination provides no additional benefit to the patient. Studies that follow and build on this work, ideally in larger cohorts including malignant lesions, will add more data to these observations and examine additional repeatability limits arising from the lesion type. Additionally, this study only examined short-term repeatability as per the Quantitative Imaging Biomarkers Alliance (QIBA) definition, 46 although the design aimed to reflect the timescale of expected changes in DWI from successful treatment. It is also important to note that data on menstrual cycle and menopause were not collected, and that it can be expected that this may vary between visits. In general, the cohort is taken to be premenopausal from population statistics. 47 Longer term repeatability will also reflect the instability of scanner performance, as well as changes arising from hardware or software. In this study, the effect of specific acquisition parameters on repeatability was not considered, but the chosen protocol is a realistic reflection of clinical DWI protocols and so presents a fair indication of expected CoVs. Reproducibility of DWI parameters will also be important in the context of multiple sites, where a higher variation of patient positioning, scanner hardware and acquisition parameters can be expected and are challenging to standardise. Where gradient nonlinearity is known to affect the accuracy of DWI parameters away from the isocentre, 48,49 consistency of patient positioning is an additional factor to consider, and these data reflect the true expected repeatability for routine clinical examination.
In conclusion, this study unambiguously demonstrates that, for a cohort of patients with benign breast lesions, there is no clearly favoured diffusion signal representation. This applies across the cohort, and within the same lesion scanned repeatedly. Appreciation of DWI measure repeatability in context will be essential in fully realising the potential, and limitations, of more complex DWI biomarkers in the clinical setting.
More complex fitting algorithms that leverage additional prior information are able to generate better parameter maps, 28,29 which may ameliorate poor repeatability to some degree. These techniques, though, do not absolve the experimenter of the responsibility to critically examine precisely what information the data contain, and not merely what they purport to show. In particular, the choice of representation used for analysis is not arbitrary because the resulting parameters have different repeatability characteristics, and will influence the ability to confidently infer physiological changes by DWI. This is illustrated by the different coefficients of variations derived for the investigated parameters, all drawn from the same data. The incorporation of repeated measurements for assessment of parameter repeatability in situ is critical in clinical imaging trials. The same applies to assessment of the type and details of analysis, not just the resulting parameters, and indicates the need for a deliberate effort to fully report acquisition and analysis protocols, which will also allow for more informed retrospective and alternative analyses.