Diffusion MRI microstructure models with in vivo human brain Connectom data: results from a multi-group comparison

A large number of mathematical models have been proposed to describe the measured signal in diffusion-weighted (DW) magnetic resonance imaging (MRI) and infer properties about the white matter microstructure. However, a head-to-head comparison of DW-MRI models is critically missing in the field. To address this deficiency, we organized the"White Matter Modeling Challenge"during the International Symposium on Biomedical Imaging (ISBI) 2015 conference. This competition aimed at identifying the DW-MRI models that best predict unseen DW data. in vivo DW-MRI data was acquired on the Connectom scanner at the A.A.Martinos Center (Massachusetts General Hospital) using gradients strength of up to 300 mT/m and a broad set of diffusion times. We focused on assessing the DW signal prediction in two regions: the genu in the corpus callosum, where the fibres are relatively straight and parallel, and the fornix, where the configuration of fibres is more complex. The challenge participants had access to three-quarters of the whole dataset, and their models were ranked on their ability to predict the remaining unseen quarter of data. In this paper we provide both an overview and a more in-depth description of each evaluated model, report the challenge results, and infer trends about the model characteristics that were associated with high model ranking. This work provides a much needed benchmark for DW-MRI models. The acquired data and model details for signal prediction evaluation are provided online to encourage a larger scale assessment of diffusion models in the future.


Introduction
Diffusion-weighted (DW) magnetic resonance imaging (MRI) can provide unique insights into the microstructure of living tissue and is increasingly used to study the microanatomy and development of normal functioning tissue as well as its pathology. Mathematical models for analysis and interpretation have been crucial for the clinical adoption of DW-MRI. Even though diffusion tensor imaging (DTI) (Basser et al., 1994), which is based on a simple model of the DW-MRI signal, has shown promise in clinical applications (Assaf and Pasternak, 2008), e.g. Alzheimer's disease (Rose et al., 2000), Multiple Sclerosis (Werring et al., 2000) or brain tumors (Price et al., 2006), a much wider variety of DW-MRI models has been proposed to extract more information from the DW signal.
Despite this explosion of DW-MRI models, a broad comparison on a common dataset and within a common evaluation framework is lacking, so little is understood about which models are more plausible representations or explanations of the signal. Panagiotaki et al. (2012) established a taxonomy of diffusion compartment models and compared 47 of them using data from the fixed corpus callosum of a rat acquired on a pre-clinical system. Later, Ferizi et al. (2014) performed a similar experiment using data from a live human subject, while Ferizi et al. (2013Ferizi et al. ( , 2015 explored a different class of models that aim to capture fiber dispersion. Rokem et al. (2015) compared two classes of models using cross-validation and test-retest accuracy. All these studies (Panagiotaki et al., 2012;Ferizi, 2014;Rokem et al., 2015) aim to evaluate variations with specific classes of models with all other variables of the parameter estimation pipeline (i.e. noise model, fitting routine, etc.) fixed. While this provides fundamental insight into which compartments are important in compartment models, questions remain about the broader landscape of models; in particular, which classes of models explain the signal best and how strongly performance depends on the choice of parameter-estimation procedure.
Publicly organized challenges provide a unique opportunity to bring a research community together to gain a quantitative and unbiased comparison of a diverse set of methods applicable to a particular dataprocessing task. Such publicly organized challenges have helped to establish a common ground for the evaluation of competing methods in a variety of imaging-related tasks, e.g. registration of MRI brain images (Klein et al., 2009), diagnostic group classification for dementia using structural MRI (Bron et al., 2015), tissue segmentation on brain (Mendrik et al., 2015) and prostate tissue (Litjens et al., 2014), on CT images for thoracic tissue (Murphy et al., 2011), carotid tissue (Hameeteman et al., 2011), and breathing airways tissue (Lo et al., 2012), fetal ultrasound images (Rueda et al., 2014), or particle tracking (Chenouard et al., 2014) have been organized. In DW-MRI, public challenges have focused on recovering synthetic intra-voxel fibre configurations (Daducci et al., 2014) or evaluating tractography techniques (Fillard et al., 2011;Pujol et al., 2015) and have been very successful at driving research and translation forward. Another interesting comparison of reconstruction methods using DW-MRI data was based on signal acquired from a physical phantom (Ning et al., 2015).
Here we report on such a community-wide challenge to model the variation of DW-MRI signals at the voxel level in the in vivo human brain. Modelling the diffusion signal is a key step in realising practical and reliable quantitative imaging techniques based on diffusion MRI. The challenge in the area is to extract the salient features from the diffusion signal and relate them to the principal features of the underlying tissue (e.g., in the case of brain white matter, the fibre orientation, axonal packing and axonal size). Three distinct questions arise: i) given the richest possible dataset that samples the space of achievable measurements as widely as possible, which mathematical model can capture best the intrinsic variation of the acquired signal; ii) which tissue features can be derived from the model; iii) what subset of those features can we estimate from limited acquisition time on a standard clinical scanner and what dataset best supports such estimates?
The intuition gained from (i) is generalisable over a wide range of applications, while (ii) and (iii) are highly dependent on the MRI study design and the available hardware. Therefore, our challenge focuses on question (i), as an understanding of (i) is necessary to inform (ii) and (iii). To that end, we acquire the richest possible dataset using the most powerful hardware available and the most motivated subject available 3 (UF). Specifically, we use the Connectome scanner (Setsompop et al., 2013), which is unique among human scanners in having 300 mT/m gradients, rather than 40 mT/m typical of state-of-the-art human scanners.
Preclinical work by Dyrby et al. (2013) highlights the benefits of such strong gradients and the first results from the Connectome scanner Duval et al., 2014;Ferizi et al., 2015;Huang et al., 2015) are now starting to verify those findings.
The uniquely rich dataset from Ferizi et al. (2015), acquired on the Connectome system, samples around five thousand points in the space of possible measurements from a standard Stejskal-Tanner DW-MRI sequence. Each DWI has a unique combination of gradient strength, diffusion time, pulse width and echo time; i.e. they vary all the key parameters of the sequence to highlight sensitivity to as many underlying tissue properties as possible. This offers a unique opportunity for the comparison of the many different types of models within a common framework, over a very wide range of the measurement space. Using this rich dataset we organized the White Matter Modeling challenge, held during the 2015 International Symposium on Biomedical Imaging (ISBI) in New York. The goal of the challenge was to evaluate and compare the models in two different tissue configurations that are common in the brain: 1) a white matter region of interest where fibers are relatively straight and parallel, specifically, the genu of the corpus callosum; and 2) a region in which the fiber configuration is more complex, specifically, the fornix. Challenge participants had access to three-quarters of each whole dataset; the participating models were evaluated on how well they predicted the remaining 'unseen' part of the data. This kind of model comparison, based on prediction error, is a common and crucial part of the development of any statistical model-based estimation applications.
Books such as ? explain how and why such comparisons should be performed to reject models that are theoretically plausible but that the data do not support. As announced before the challenge, the final ranking was based exclusively on the performance on the genu data. In this paper, however, we include results from both the genu and the fornix.
The challenge entries include a wide range of different kinds of model and estimation procedure. In contrast to earlier model comparisons (Panagiotaki et al., 2012;Ferizi, 2014;Rokem et al., 2015), the results provide new insight into which broad classes of model explain the signal best and what features of the estimation procedure are important. Although the sampling of the set of possible techniques is necessarily sparse, as any model could in theory combine with any estimation procedure and each has many variables, the results provide some surprising and key insights into the benefits of different approaches.
This information is very timely, as recent model-based diffusion MRI techniques, such as NODDI (Zhang et al., 2012), SMT (Kaden et al., 2015(Kaden et al., , 2016, DIAMOND (Scherrer et al., 2015), DKI (Fieremans et al., 2011) and LEMONADE (Novikov D, 2015), are starting to become widely adopted in clinical studies and trials. Despite their success, intense debate continues in the field about applicability of different models and fitting routines (Jelescu et al., , 2016. The insights from this challenge provide key pointers to the important features of the next-generation of front-line imaging techniques of this type. Moreover, the data and evaluation routines remain available to form the basis of an expanding ranking of models and fitting routines and a standard yardstick for future model development.
The paper is organized as follows. We first describe in section 2 the experimental protocol, data postprocessing and preparation of the training and testing data. We then present the methods for ranking the models and tabulate succinctly the various models involved in the competition. We report the challenge results in section 3 and discuss these results in section 4; a more detailed description of the models follows in the Appendix, section 7.

The complete experiment protocol
One healthy volunteer was scanned over two non-stop 4h sessions. The imaged volume comprised twenty 4mm-thick whole-brain sagittal slices covering the corpus callosum left-right. The image size was 110 x 110 and the in-plane resolution 2 x 2 mm 2 . Forty-five unique and evenly distributed diffusion directions (taken from http://www.camino.org.uk) were acquired for each shell, with both positive and negative polarities; these directions were the same in each shell. We also included 10 interleaved b=0 measurements, leading to a total of 100 measurements per shell. Each shell had a unique combination of ∆ = {22, 40, 60, 80, 100, 120} ms, δ = {3, 8} ms, and |G| = {60, 100, 200, 300} mT/m (see Table 1). The measurements were randomized within each shell, whereas the gradient strengths were interleaved. We visually inspected the images and have not observed any obvious shifts from gradient heating. The minimum possible echo time (TE) for each gradient duration and diffusion time combination was chosen to enhance SNR for shorter diffusion times, and potentially enables estimation of compartment-specific relaxation constants. The SNR of b = 0 images was 35 at TE = 49 ms and 6 at TE = 152 ms. To find the SNR we used the background signal, as well as the signal noise floor in b = 0 images i.e. the residual signal along the fibres at the highest b-value. In both cases these estimates matched reasonably well. More details about the acquisition protocol can be found in Ferizi et al. (2015).

Post-processing
All post-processing was performed using FSL (Jenkinson et al., 2002). The DW images were corrected for eddy current distortions separately for each combination of δ and ∆ using FSL's Eddy module (www.fmrib.ox.ac.uk/fsl/eddy) with its default settings. The images were then co-registered using FSL's F nirt package. As the 48 shells were acquired across a wide range of TEs, over two days, we chose to proceed in two steps. First, within each quarter of the dataset (different day, different δ) we registered all the b=0 images together. We then applied these transformations to their intermediary DW images, using a tri-linear resampling interpolation. The second stage involved co-registering the four different quarters. To help the co-registration, especially between the two days' images which required some through-plane adjustment as well, we omitted areas of considerable eddy-current distortions by reducing the number of slices from 20 to 5 (that is, leaving two images either side of the mid-sagittal plane) and reducing the in-plane image size to 75x80.

Training and testing data
The data for this work originated from two ROIs, each containing 6 voxels (see Fig. 2). The first ROI was selected in the middle of the genu in the corpus callosum, where the fibres are mostly straight and coherent. The second ROI's fibre configuration is more complex: it lies in the body of fornix, where two bundles of fibers bend and bifurcate.
The dataset was split into two parts: the training dataset and the testing dataset. The training dataset was fully available for the challenge participants. The testing dataset was retained by the organisers.
The DW signal of the training dataset (36 shells, with acquisition parameters shown in black in Table   1) was provided together with the gradient scheme on the challenge website (http://cmic.cs.ucl.ac.uk/ wmmchallenge/). This data was used by the participants to estimate their DW-MRI model parameters. The signal attenuation in the testing dataset (12 shells, with acquisition parameters shown in red in Table 1) was kept unseen. Participants were then asked to predict the signal for the corresponding gradient scheme.
The challenge participants were free to use as much or as little of the training data provided to predict the signal of the test dataset for the six voxels in each ROI. Figure 4 shows the DW signal attenuation for each shell in the genu dataset, with stars in the legend indicating which shells were left out for testing. In this plot, a small number of data appear as 'outliers' (two such data are shown with arrows in the bottom-left subplot of Figure 4). Specifically, we counted about 10 of them among more than 4812 measurements, most of them being in the b=300 s/mm 2 shell.
Since these outliers appear to be specific to the b=300 s/mm 2 shell, and not in other shells with similar b-value, we attribute them to a momentary twitching of the subject rather than more systematic affects, such as perfusion. Similarly, figure 6 shows the signal for the fornix region, with the signal over the six voxels averaged out.

Models ranking
Models were evaluated and ranked based on their ability to accurately predict the unseen DW signal.
Specifically, the metric used was the sum of square differences between the hidden signal and the predicted signal, corrected for Rician noise (Jones and Basser, 2004): where N is the number of measurements,S i is the i-th measured signal, S i its prediction from the model, and σ is the noise standard deviation.

Competing models
We give in this section a short summary of competing models in the challenge. Additionally, Table 2 provides a summary of the key characteristics of the competing models. A more detailed description of each model is included in the Appendix, section 7.
• Ramirez-Manzanares: A dictionary-based technique that accounts for multiple fibre bundles, and models the distribution of tissues properties (axon radius, parallel diffusivity) and the orientation dispersion of fibres.
• Nilsson: A multi-compartment model that models isotropic, hindered and restricted diffusion, and accounts for varying (T 1 , T 2 ) relaxation times for each compartment (Nilsson et al., 2012).
• Scherrer A multi-compartment model in which each compartment is modelled by a statistical distribution of 3-D tensors (Scherrer et al., 2015) • Ferizi 1 and Ferizi 2 : Two three-compartment models that account for varying T 2 relaxation times for each compartment. As regards the intracellular compartment, Ferizi 1 models the orientation dispersion by using dispersed sticks as one compartment; Ferizi 2 uses a single radius cylinder instead (Ferizi et al., 2015).
• Poot: A 3-compartment model comprising an isotropic diffusion compartment, a tensor compartment, and a model-free compartment in which an ADC is estimated for each direction independently. T 2 relaxation times are also estimated for each compartment (Poot and Klein., 2015).
• Rokem: A combination of the sparse fascicle model Rokem et al. (2015) with restriction spectrum imaging White et al. (2013) that describes the signal arising from a multi-compartment model in a densely sampled spherical grid, using L1 regularization to enforce sparsity.
• Eufracio: An extension of the Diffusion Basis Function (DBF) model that accounts for multiple b-value shells.
• Loyas-Olivas 1 and Loyas-Olivas 2 : Two models based on the Linear Acceleration of Sparse and Adaptive Diffusion Dictionary (LASADD) technique. Loyas-Olivas 1 uses the DBF signal model, while Loyas-Olivas 2 uses a three-compartment tissue model. The optimization uses linearized signal models to speed up computation and sparseness constraints to regularise.
• Alipoor: A model of fourth-order tensors, corrected for T 2 -relaxation across different shells. A robust LS fitting was applied to mitigate influence of outliers.
• Sakaie A two-compartment model of restricted and hindered diffusion with angular variation. A simple exclusion scheme based on the b=0 signal intensity was applied to remove outliers.
• Fick: A spatio-temporal signal model to simultaneously represent 3-D diffusion signal over varying diffusion time. Laplacian regularization was applied during the fitting .
• Rivera: A regularized linear regression model of diffusion encoding variables. This is intentionally built as a simplistic model to be used as a baseline for model comparison.
3. Results Figure 8 shows the averaged prediction error in each ROI (top subplot is for the genu, bottom subplot is for the fornix) and the corresponding overall ranking of the participating models in the challenge. The first six models in the genu ranking performed similarly, each higher ranked model marginally improving on the prediction error. The prediction error clearly increased at a higher rate for the subsequent models.
In the fornix dataset, the prediction error was higher than in the genu. For both datasets the first six models were the same, albeit permuted. Most of the models performed similarly in terms of ranking in both genu and fornix cases, i.e. Nilsson (2nd in genu/1st in fornix), Scherrer (3rd/2nd) and Ferizi 2 (4th/4th).
Others performed significantly better in one of the cases, with Ramirez-Manzanares (1st/6th) being the most notable. inevitably vary in their prediction capabilities; some models perform better within a given b-value range but are penalised more in another. Across the models, as the figure shows, the ranking between models was dominated by the signal prediction accuracy for b-values between 750 and 1400 s/mm 2 ; specifically, the shell which has the largest weight on this error is the b=1100 s/mm 2 one. The top-ranking models, nevertheless, were better at predicting the signal for higher b-value images as well. The prediction performance of lower b-value images (<750 s/mm 2 ) in the genu was less consistent across ranks. For example the models of Rokem and Sakaie outperformed most of the higher ranking models in this low b-value range. The fornix is a more complex region than the genu, hence the performance across the shells is less consistent. In the fornix the prediction errors were generally larger than in the genu across all b-values for all models, except Rivera's, which showed the opposite. The prediction errors of the b = 0 images were also larger than in the genu, especially for the highly ranked models of Poot and Ferizi. The prediction errors in other b-value shells followed more closely the overall ranking of the models. Figure 10 shows the prediction error for each voxel independently. In the genu plot, the best performing models had high consistency of low prediction errors across all individual voxels. These were followed by the models with consistent larger prediction error in all voxels. Most of the lowest ranking models not only had 8 largest prediction errors, they also showed large variations in prediction performance. For example, while the model of Loya-Olivas 2 was competitive in voxel 5, it ranked low due to large prediction errors in voxel 4 and voxel 6. The results in the fornix show a lower consistency of prediction errors between the voxels than in the genu. Specifically, two voxels (3 and 4) showed substantially larger prediction errors and were likely responsible for much of the overall ranking.
Finally, we report in figures 12 and 14 an illustration of the quality of fit of each model to 4 representative shells, including the b=1100 s/mm 2 shell mentioned above.

Discussion
The challenge set out to compare the ability of various kinds of models to predict the diffusion MR signal from white matter over a very wide range of measurement parameters -exploring the boundaries of possible future quantitative diffusion MR techniques. The fourteen challenge entries were a good representation of the many available models that are proposed in the literature. They also use a variety of fitting routines, and so provide additional insight into the effects of algorithmic choices during parameter estimation. Although the set of methods tested is not sufficient to make a full comparison of each independent feature (diffusion model, noise model, fitting routine, etc.), and the number of combinations prohibits an exhaustive comparison, the results of the challenge do reveal some important trends.

Main conclusions
The first insight is on the type of model used. Signal models do not necessarily outrank tissue models; indeed, models of the signal (Alipoor, Sakaie, Fick, Riviera) ranked on average lower than models of the tissues with our dataset, despite their theoretical ability to offer more flexibility in describing the raw signal.
This is quite surprising, as the current perception within the field is that, generally, we can capture the signal variation much better through a functional description of the signal (signal models) rather than via a biophysical model of the tissue (tissue models). The former generally consist of bases of arbitrary complexity, whereas the latter are generally very parsimonious models that rely on extremely crude descriptions of tissue (e.g. white matter as parallel impermeable cylinders). The results suggest that the flexibility of signal models can rapidly lead to over-fitting. However, the tissue models can explain the signal relatively well even with just a few parameters (compare the quality-of-fit plots of the Rivera model in figure 14 to the signal prediction of the top models in figure 12; the higher the b-value, the worse the prediction of the linear signal model). Certain underlying assumptions may cause the signal models to perform less well than expected.
For example, they are often designed to work with data with a single diffusion time and do not generalise naturally to incorporate the additional dimension (although see Fick et al. (2015) for some steps towards generalisation). Many of the tissue models on the other hand naturally account for finite δ, varying diffusion times and gradient strength (e.g. the Ramirez-Manzanares, Nilsson and Ferizi models in our collection).
The second insight concerns the choice of noise modelling. Assuming a non-Gaussian noise model, as used by three models, provides little benefit over a Gaussian assumption. This is likely because much of the data has high SNR. Although signal levels at high b-values do often hit the noise floor, the magnitude of the noise floor is small compared to signals at moderate b-values. In this challenge most participants used non-linear least-squares or maximum likelihood optimisation. Additional regularisation of the objective function (Eufracio&Rivera/Lasso, Rokem/Elastic Net, Fick/Laplacian) appeared to have little benefit over non-regularised optimization.
The third observation is about removing signal outliers. Five of the eleven models preprocessed the training data by clearing out outliers, including the top two models. We tried this procedure with two good models which did not use such a procedure, Ferizi 1 and Ferizi 2 , and observed that it did not affect the ranking, though it did marginally improve the prediction error. This is understandable considering the relative little weight these apparent outliers have on the total number of measurements (10 points from a 4812-strong dataset). Additionally, specific strategies for predicting signal, e.g. bootstrapping or crossvalidation, as used by the top two models of Ramirez-Manzanares and Nilsson, may also help the model ranking.

Limitations and future directions
Although this challenge provides several new insights into the choice of model and fitting procedure for diffusion-based quantitative imaging tools, it has a number of limitations that future challenges might be designed to address. One limitations of the study is that we use a very rich acquisition protocol that is not representative of common or clinical acquisition protocols. In particular, we cover a very wide range of bvalues and the data acquisition (protocol) we use consists of many TEs unlike many other multi-shell diffusion datasets that use a fixed TE. As stated in the Introduction, our intention is to sample the measurement space as widely as possible to support the most informative models possible. Varying the TE makes it possible to probe compartment-specific T 2 (whose decay Ferizi et al. (2015) finds to be monoexponential at the voxel level), an investigation which would be impossible with a single TE. However, the good performance of DIAMOND also shows that a model with fixed δ and ∆ could be used with multi-TE datasets, and that, while the majority of the full data was ignored in each of the reconstructions, its prediction error compared favourably with other techniques.
We use the unique human Connectome scanner (Setsompop et al., 2013) to acquire a dataset with gradients of up to 300 mT/m, which is not readily available in most current MR machines. However, previous preclinical work by Dyrby et al. (2013) suggests that high diffusion gradients enrich the signal, which helps model fitting and comparison. Future challenges might be designed that focus on explaining the signal and estimating parameters from data more typical of clinical acquisitions.
Assessing the prediction performance on unseen data as in this challenge is different from assessing the fitting error: it implicitly penalises models which overfit the data. However, since most of the missing shells lie in-between other shells (in terms of b-values, TEs, etc.), the quality of signal extrapolation was not assessed. We get a glimpse of this from figure 8, where the SSE is unevenly distributed between the b-values.
Here, the shell which bore the largest error is the b=1100 s/mm 2 one; see also figures 12 and 14. Of all "unseen" shells, this shell combines the lowest ∆ and highest |G|, placing it on the edge of the range of the measurement space sampled. Such a b-value shell combines high signal magnitude with high sensitivity (i.e. the gradient of signal against b-value is highest in this range), which makes it hard to predict. Future work can take this further, by selecting unseen shells outside of the min-max range of experimental parameters.
This is likely to penalise more complex models that overfit the data even more strongly.
We did not take into account the computational demand of each model, and this might limit the generalisation of the results. Models that use bootstrapping generally have a higher computational burden and may not be feasible for large datasets, e.g. whole brain coverage.
The dataset used in this challenge is specific to one subject who underwent a long duration acquisition, which adds to the question of the generalisability. The subsequent preprocessing of the data is also a factor to bear in mind: the registration of two 4h datasets, across such a broad range of echo times, poses its own challenges for certain non-homogenous regions in the brain, such as the fornix (as compared with, for example, the relatively large genu). Thus the results may be somewhat subject specific and may be affected by residual alignment errors.
Another limitation is that we only look at isolated voxels inside the corpus callosum and the fornix.
Questions still remain about which models are viable even in the most coherent areas of the brain with the simplest geometry so we believe our focused challenge on well-defined areas is an informative first step necessary before extending the idea to the whole of white matter, which would make for an extremely complex challenge. We note however, recent work by Ghosh et al. (2016) that illustrates such an approach with Human Connectome Project (HCP) data.
We focused here on comparing models based on their ability to predict unseen data. Although models that reflect true underlying tissue structure should explain the data well, we cannot infer in general that models that predict unseen data better are mechanistically closer to the tissue than those that do not. As we discuss in the introduction, the main power of evaluating models in terms of prediction error is to reject models that cannot explain the data. Thus, while the identification of parsimonious models that explain the data certainly has great benefit, further validation is necessary through comparison of the parameters that they estimate with independent measurements, e.g. obtained through microscopy (our challenge makes no attempt to assess the integrity of parameter estimates themselves, but future challenges might use such performance criteria). We note however that major difficulties arise in obtaining ground truth in realistic samples. In particular, histology does not provide a perfect ground truth for assessing quantitative noninvasive techniques. There are two good reasons for this: a) the histological preparation process disrupts the tissue from its in-vivo state and b) certain parameters simply cannot be measured histologically, e.g. diffusivity and permeability. Moreover, the results from this study do not immediately translate into the ability of estimated models to provide useful information about the WM microstructure integrity, such as the presence of axonal loss, demyelination or oedema in abnormal tissue. Specifically for tissue models, obtaining a direct histological analogue is often difficult. One good example of this are models that incorporate an axonal radius parameter, known to generally overestimate the true axonal radius, as discussed in detail e.g.

Conclusion
Challenges such as this have great value in bringing the community together and provide unbiased comparison of wide ranging solutions to key data-processing problems. They raise new insights and ideas motivating more directed future studies. The data is publicly available for others to use, with more details of the dataset given on the Challenge website http://cmic.cs.ucl.ac.uk/wmmchallenge/. On this website, an up-to-date ranking of the models will be available too, where additional models can be added after the publication of the article. This will provide an important yardstick for future models and parameter estimation routines.
• U. Ferizi is also supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) of the National Institute of Health (NIH) under award numbers R21AR066897 and RO1AR067789.
• B. Scherrer was supported in part by NIH R01 NS079788, R01 EB019483, R01 EB018988 and BCH TRP Pilot and BCH CTREC K-to-R Merit Award.

• M. Nilsson is supported by the Swedish Strategic Research (SSF) Grant AM13-0090
The content is solely the responsibility of the authors, and does not necessarily represent the official views of the funding bodies (EPSRC or NIH).
Lastly, we would like to thank ISBI 2015 challenge organisers, in particular, Stephen R. Aylward (Kitware Inc., USA) and Badri Roysam (University of Houston, USA). This work builds on the statistical modelling of the apparent diffusion coefficient (Yablonskiy et al., 2003), and tackles the modelling of axon fiber dispersion in single (Axer et al., 2001;Zhang et al., 2012) and multiple fibre bundle cases. The method empirically estimates (rather than imposes) the distribution of tissue properties (axon radius, parallel diffusion, etc.), as well as the orientational distribution of the bundles. The general framework is as follows: • estimation of mean principal diffusion directions (PDD) per axon bundle; • selection of a dense set of orientationally-focused basis directions that capture the discrete nonparametric fiber dispersion; = exp −qτ diso for d iso = 2, 2.1, 2.2, . . . , 4 µm 2 /ms, and the dot signal that takes into account static proton density. The values of the dictionary-atoms above were tuned by cross validation (Stone, 1974). The size compartments β ≥ 0 computed in the weighted non-negative LS formulation: indicate the atoms that explain the signal, the W weights are proportional to SNR. Overfitting is reduced by a bootstrap (Efron, 1979) procedure.
The cross-validation experiments indicate that the reconstructions given by the robust fitting of this rich multi-compartment diffusion dictionary allows to accurately predict non acquired MR signals for different machine protocols. This is of most interest in the development of methods able to detect the complex microstructure heterogeneity associated with the different compartments within the voxels. The atoms 13 with coefficients β > 0 depict the empirical distributions, and their orientations indicate non-parametrical bundle-dispersion configurations (as fanning or radially symmetrical). The recovered distributions reveal, for instance, the presence of axon radius only among 1 and 4 µm. One should take into account, however, that since the heterogeneous intra/extra-axonal T2 relaxation feature is not explicitly modeled, the method may compensate T2 variations by using, for instance, large isotropic d iso coefficients to accurately fit the signal. For this reason, a direct interpretation of the fitted parameters may be misleading. The use of more specific models is a part of ongoing work. Three modifications were performed to this very general model. First, to accommodate for potential bias in the b 0 images (which was the case for fornix data where deviations of up to 20σ was observed), the prediction for b 0 data was obtained from the median of all signals acquired with identical TE instead of from eq. 3. Second, opposite direction acquisitions were rescaled by a free model parameter, in order to allow for potential gradient instabilities inducing differences between the directions and their opposite directions. Third, models were generated dynamically during fitting by randomly selecting up to four hindered components and up to three restricted components. One isotropic component was always included.
The model was first fitted to half of the diffusion-weighted data (randomly selected), after which outliers were rejected (> 2.5σ). Thereafter a second fit was performed. Both fit steps assumed Gaussian noise and utilized the 'lsqcurvefit' function in Matlab. The procedure was repeated 100 times for different randomly generated models.
To prepare for submission of the results, only the models that best predicted the hidden half of the data was selected, after which the median of the selected predictions were used for the final prediction.

Scherrer (Harvard, USA): Distribution of anisotropic microstructural environments in diffusion compartment imaging (DIAMOND)
DIAMOND models the set of tissue compartments in each voxel by a finite sum of unimodal continuous distributions of diffusion tensors. This corresponds to a hybrid tissue model that combines biophysical and statistical modeling. As described in (Scherrer et al., 2015), the DW signal S k for a gradient vector g k and b-value b k is modeled by: N is the number of compartments, f j the relative fraction of occupancy of the j th compartment and κ j are D 0 j are respectively the concentration and the expectation of the j th continuous tensor distribution. DIAMOND enables assessment of compartment-specific diffusion characteristics such as the compartment FA (cFA), the compartment RD (cRD) and the compartment MD (cMD). It also provides a novel measure of microstructural heterogeneity for each compartment.
The estimation of a continuous distribution of diffusion tensors requires DW data acquired with same timing parameters δ and ∆ (Scherrer et al., 2015). To compare DIAMOND to other models on this dataset, we fitted separately one DIAMOND model for each {δ, ∆} group (i.e., for each TE group), leading to 12 DIAMOND models. One shell was missing in each TE group; we predicted its signal using the corresponding DIAMOND model. The model estimation was achieved as follows. We first computed the mean and standard deviation of S 0 (µ S0 and σ S0 ) within each TE group and discarded DW-signals whose intensity were larger than µ S0 + 3σ S0 (simple artefact correction). We then estimated DIAMOND parameters as described in Scherrer et al. (2015), considering Gaussian noise and cylindrical anisotropic compartments. For the genu we considered a model with one freely diffusing and one anisotropic compartment; for the fornix we considered a model with one freely diffusing compartment and two anisotropic compartments.

Ferizi 1 and Ferizi 2 (UCL, England)
This submission uses two three-compartment models, as described in previous studies (Ferizi et al., , 2013. These models consist of: 1) either a Bingham distribution of sticks or a Cylinder for the intracellular compartment; 2) a diffusion tensor for the extracellular compartment; 3) an isotropic CSF compartment.
The T 2 relaxation element is fitted beforehand, to the (variable echo time) b=0 measurements. The signal model is: where f i , f e and f c are the weights of the intracellular, extracellular, and third normalised compartment signals S i , S e and S c , respectively; the values of compartmental T 2 are indexed similarly;S 0 is the proton density signal (which is TE-independent, and obtained from fitting to the b = 0 signal). These models, as shown in the figure below, emerged from previous studies (see references below). Here, however, a single white matter T2 and separate compartmental diffusivities are additionally fitted.
There is a two-stage model fitting procedure. The first step estimates the T2 decay rate of tissue, separately in each voxel, by fitting a bi-exponential model to the b=0 intensity as a function of TE, in which one component is from tissue and the other from CSF. A preliminary analysis of voxels fully inside WM regions shows no significant departure from mono-exponential decay, equal T2 are then assumed within the intra and extracellular compartments. When fitting the bi-exponential model, the value of T2 in CSF is fixed to 1,000ms (a more precise value of CSF is unlikely to be estimated with this protocol). Thus, for each voxel, the volume fraction of CSF, theS 0 and the T2 of the tissue are estimated. These three estimates are then fixed for all the subsequent model fits. Then, each model is fitted using the Levenberg-Marquardt algorithm with an offset-Gaussian noise model.

Poot (Erasmus, the Netherlands)
This submission uses a three compartment model, with for each compartment a different complexity of the diffusion model and an individual T 2 value. This model was developed specifically for the ISBI WM challenge and is the result of iteratively visualizing different projections of the residuals and trying to infer the maximum complexity that the rich data supports.
The first compartment models isotropic diffusion and, through the initialization procedure, it captures the fast diffusion components. The second compartment is modelled by a second order (diffusion) tensor and models intermediate diffusion strengths. The third compartment is model-free as the ADC is estimated for each direction independently. Each compartment additionally has an individual T 2 value and signal intensity at b = 0, TE = 0 (which could easily be translated into volume fractions). Hence, the complete model of a voxel in image j is given by where S j is the predicted signal intensity of image j, A i is the non-diffusion weighted signal intensity of compartment i at zero TE , TE is the echo time, R 2 is the reciprocal of the T 2 relaxation time of where d is a vector with the ADC value of each orientation group and h j is a vector that selects the orientation group to which image j belongs (90 groups in total). Note that h j has at most one nonzero element and that element has a value of one. As displayed in the right most part of Eq. (5), the model can be written as a multiplication of matrices M i , containing all rows M i,j , with θ = [ln A 1 , R 2,1 , c, ln A 2 , R 2,2 , D 11 , D 12 , D 13 , D 22 , D 23 , D 33 , ln A 3 , R 2,3 , d] T , which combines all 103 parameters into a single parameter vector. All parameters are simultaneously estimated from the provided 3 311 measurements per voxel by a maximum likelihood estimator that assumes a Rician distribution of the measurements and simultaneously optimizes the noise level (Poot and Klein., 2015). The exact initialization and details of the optimization procedure are provided in the online supporting material. Finally, the signal intensities of the 'unseen' data are predicted by substituting the estimate into Eq. (5).
7.1.6. Rokem (Standford, USA): A restriction spectrum sparse fascicle model (RS-SFM) The Sparse Fascicle Model, SFM (Rokem et al., 2015), is a member of the large family of models that account for the diffusion MRI signal in the white matter as a combination of signals due to compartments corresponding to different axonal fiber populations (fascicles), and other parts of the tissue. Model fitting proceeds in two steps. First, an isotropic component is fit. We model the effects of both the measurement echo time (TE), as well as the measurement b-value on the signal. These are fit as a log(T E)-dependent decay with a low order polynomial function, and a b-value-dependent multi-exponential decay (including also an offset to account for the Rician noise floor). The residuals from the isotropic component are then deconvolved with the perturbations in the signal due to a set of fascicle kernels each modeled as a radially symmetric (λ 2 = λ 3 ) diffusion tensor. The putative kernels are distributed in a dense sampling grid on the sphere. Furthermore, Restriction Spectrum Imaging (RSI (White et al., 2013)) is used to extend the model, by adding a range of fascicle kernels in each sampling point, with different axial and radial diffusivities, capturing diffusion at different scales. To restrict the number of anisotropic components (fascicles) in each voxel, and to prevent overfitting, the RS-SFM model employs the Elastic Net algorithm (EN (Zou and Hastie, 2005)), which applies a tunable combination of L1 and L2 regularization on the weights of the fascicle kernels. We used elements of the SFM implemented in the dipy software library (Garyfallidis et al., 2014) and the EN implemented in scikit-learn (Pedregosa et al., 2011). In addition, to account for differences in SNR, we implemented a weighted least-squares strategy whereby each signal's contribution to the fit was weighted by its TE, as well as the gradient strength used. EN has two tuning parameters determining: 1) the ratio of L1-to-L2 regularization, and 2) the weight of the regularization relative to the least-squares fit to the signal. To find the proper values of these parameters, we employed k-fold cross-validation (Rokem et al., 2015), leaving out one shell of measurement in each iteration for cross-validation. We determined that the tuning parameters with the lowest LSE (Panagiotaki et al., 2012) provide an almost-even balance of L1 and L2 penalty with weak overall regularization. Because of the combination of a dense sampling grid (362 points distributed on the sphere), and multiple restriction kernels (45 per sampling point), the maximal number of parameters for the model is approximately 16,300, more than the number of data points.
However, because regularization is employed, the effective number of parameters is much smaller, resulting in an active set of approximately 20 regressors (Zou et al., 2007). We have made code to fully reproduce our results available at https://arokem.github.io/ISBI2015.
The DBF model is reformulated by substituting φ ij and T j : The first exponential can be defined as a scale factor that depends on the b-values, β i = exp(−b i χ 2 q T i q i ). In this way, the β i factors are associated with different b-values, so the new model includes the information of multi-shell schemes. The coefficients α and the shell scale factor β are computed by solving the optimisation problem: and C is the set of indices grouped by different b-values (and #C is the number of elements in it). The regularization term weighted by λ α demands sparseness and the term weighted by λ β prevents an over-fitting. The problem in eq.(6) is solved in three steps. First, the active atoms are predicted (α i > 0) withα = argmin α f (α, β c ; λ α , λ β ). Second, the active atoms are corrected with α = argmin {αi}:αi>0 f (α, β c ; 0, λ β ). Finally, the factors β c are updated with β c = argmin βc f (α, β c ; λ α , λ β ). To solve each step, the active sets algorithm for quadratic programming is used.
To train the model for the WMM'15 data, eq.6 is solved for each voxel with the training data to find the optimal weights α j and scale factors β c that best reproduce the training data. For this challenge, the β c factors are grouped by the 36 training shells and the method parameters are set by hand: λ α = 0.5, λ β = 0.02, χ 1 = 9.5 × 10 −4 and χ 2 = 5 × 10 −5 . To predict the unseen signal at each voxel, the reformulated model is used with the optimal weights α j and the 12 scale factors for the unseen β c are calculated by interpolation with the 36 optimal β c of the training data. LASADD is a multi-tensor based technique to adapt dynamically the Diffusion Functions (DFs) dictionary to a DW-MRI signal (Loya-Olivas et al., 2015;Loya, 2015). The method changes size and orientation of relevant Diffusion Tensors (DTs). The optimisation algorithm uses a special DT expression and assumptions to reduce the computational cost.
The one-compartment version (LASADD-1C) is based on DBF multi-tensor model (Ramirez-Manzanares et al., 2007): where χ {1,2} j are scalars associated to the eigenvalues, v j is the Principal Diffusion Direction (PDD), and I is the identity matrix. The algorithm iterates three steps, like Aranda et al. (2015a,b) An extra refinement to the computed results, named LASADD-3C, splits each detected DF into three compartments (Sherbondy et al., 2010): intracellular (IC), extracellular (EC) and CSF. The multi-tensor models the directional IC compartment diffusion for each fiber bundle using T IC j = χ 0j v j v T j . The EC compartment with hindered diffusion uses the representation (7) for θ i,j . The isotropic diffusion ω i uses T CSF = χ 3 I. This stage keeps fixed the PDDs and only adjust the α's and χ's of the three compartments.
The parameters of the models were estimated using the training dataset: the b values using the equation by Stejskal and Tanner (1965)  The DMRI signal is modeled as a fourth-order symmetric tensor as proposed by (Özarslan and Mareci, 2003).
T be a gradient encoding direction and corresponding design vector, respectively. The diffusion signal is then described by where S(g i ) is the measured signal when the diffusion sensitizing gradient is applied in the direction g i , S 0 is the observed signal in the absence of such a gradient, b is the diffusion weighting factor, and t ∈ R 15 contains the distinct entries of a fourth-order symmetric tensor. Note that d(g i ,t) = d(g i ) is used for simplification.
Given measurements in N > 15 different directions, the least squares (LS) estimate of the diffusion tensor iŝ t = (G T G) −1 G T y where G is an N × 15 matrix defined G = [a 1 a 2 · · · a N ] T and y i = −b −1 ln(S(g i )/S 0 ).
We use the weighted LS tensor estimation method in (Alipoor et al., 2013) to mitigate the influence of outliers.
To estimate the diffusion signal for a given acquisition protocol with T E = T E x , b = b x and δ = δ x , the two non-diffusion weighted measurements with the closest T Es to T E x (among measurements with δ = δ x ) are used to estimate T 2 and S 0 for each voxel. Then, data from the closet shell to b x (among shells with δ = δ x ) are used to estimate the tensor describing the underlying structure.
7.2.2. Sakaie-Tatsuoka-Ghosh (Cleveland, USA): An Empirical Approach As the extent of q-space in the dataset is unusually comprehensive, we chose a simple, generic approach to gain intuition. Visual inspection suggested use of a restricted and hindered component each with angular variation: where S i is the predicted signal for signal acquired with T E i , b i . A T Ei is the median signal at a given TE with no diffusion weighting. Fit parameters are f , the volume fraction of R i , the restricted component, and D i , the diffusivity. R i and D i are modeled as spherical harmonics with real, antipodal symmetry (Alexander et al., 2002) with maximum degree 4. The model has 31 fit parameters for each voxel. Data were fit using using a nonlinear least squares algorithm (lsqcurvefit, MATLAB). Prior to the fit, data points with nonzero bvalue that had signal higher than the the median of the b=0 signal plus 1.4826 times the median absolute deviation were excluded. Shells with normalized median signal smaller than that of shells with lower bvalues were also excluded. Normalization was performed by dividing by the median of the b=0 signal with the same TE.

Fick (INRIA, France): A Spatio-Temporal Functional Basis to Represent the Diffusion MRI Signal
We use our recently proposed spatio-temporal (3D+t) functional basis  to simultaneously represent the diffusion MRI signal over three-dimensional wave vector q and diffusion time τ . Based on Callaghan's theoretical model of spatio-temporal diffusion in pores (Callaghan, 1995), our basis represents the 3D+t diffusion signal attenuation E(q, τ ) as a product of a spatial and temporal functional basis as where T o is our temporal basis with basis order o and S jlm is the spatial isotropic MAP-MRI basis (Özarslan et al., 2013) with radial and angular basis orders j, l and m. Here N max and O max are the maximum spatial and temporal order of the bases, which can be chosen independently. We formulate the bases themselves as with u s and u t the spatial and temporal scaling functions, Y m l the spherical harmonics and L o a Laguerre polynomial. We calculate the spatial scaling u s by fitting an isotropic tensor to the TE-normalized signal attenuation E(q, ·) for all q. Similarly, we compute u t by fitting an exponential e −utτ /2 to E(·, τ ) for all τ . We fit our basis using Laplacian-regularized least squares in the following steps: We first denote Ξ i (q, τ, u s , u t ) = S jlm(i) (q, u s )T o(i) (τ, u t ) with i ∈ {1 . . . N coef } with N coef the number of fitted coefficients.
We then construct a design matrix Q ∈ R N data ×N coef with Q ik = S Ni (A, q k )T oi (τ k , u t ). The signal is then fitted as c = argmin c y − Qc 2 + λ U (c) with y the measured signal, c the fitted coefficients and λ the weight for our analytic Laplacian regularization U (c). We used generalized cross-validation (Craven and Wahba, 1978) to find the optimal regularization weighting λ in every voxel. In our submitted results, we used a spatial order of 8 and a temporal order of 4, resulting in 475 fitted coefficients.

Rivera (CIMAT, Mexico): Baseline Method: Robust Regression
We regard this very simplistic model as a baseline for other model-based methods. It assumes as little information as possible from the diffusion signal. The vector of independent variables is containing the gradient strength g, the echo time T E and the b-value b. Given signal s i , we then estimate the parameters of the linear regression model: where θ ∈ R 23 is the unknown vector of coefficients, is the residual error and is the matrix design (x| 2 is obtained from squaring each element of the matrix x). To account for outliers we estimate θ with a weighted (robust) least squares approach using the Lasso Regularization: where W 0 is the identity matrix and each subsequent W computed via: with outlier weighting in ω t+1 i = κ 2 /(κ 2 + (y i − Xθ t+1 i ) 2 ) though κ, an arbitrary parameter that controls the outlier sensitivity. The protocol weight v t+1 i = mean j∈Ωi {w t+1 j } and Ω i = {j : computes a confidence factor for the complete protocol.
The equations (13) and (14) are iterated three times. The final estimated signal is computed using (12), using the protocol of the unseen signal.   On the x-axis is the cosine of the angle between the applied diffusion gradient vector G and the fibre direction n. Some models in this study omit data outliers; two such data points are shown in the bottom-left subplot with vertical arrows -obviously each model has its own criteria for determining the outliers.