Accurate and robust segmentation of neuroanatomy in T1‐weighted MRI by combining spatial priors with deep convolutional neural networks

Abstract Neuroanatomical segmentation in magnetic resonance imaging (MRI) of the brain is a prerequisite for quantitative volume, thickness, and shape measurements, as well as an important intermediate step in many preprocessing pipelines. This work introduces a new highly accurate and versatile method based on 3D convolutional neural networks for the automatic segmentation of neuroanatomy in T1‐weighted MRI. In combination with a deep 3D fully convolutional architecture, efficient linear registration‐derived spatial priors are used to incorporate additional spatial context into the network. An aggressive data augmentation scheme using random elastic deformations is also used to regularize the networks, allowing for excellent performance even in cases where only limited labeled training data are available. Applied to hippocampus segmentation in an elderly population (mean Dice coefficient = 92.1%) and subcortical segmentation in a healthy adult population (mean Dice coefficient = 89.5%), we demonstrate new state‐of‐the‐art accuracies and a high robustness to outliers. Further validation on a multistructure segmentation task in a scan–rescan dataset demonstrates accuracy (mean Dice coefficient = 86.6%) similar to the scan–rescan reliability of expert manual segmentations (mean Dice coefficient = 86.9%), and improved reliability compared to both expert manual segmentations and automated segmentations using FIRST. Furthermore, our method maintains a highly competitive runtime performance (e.g., requiring only 10 s for left/right hippocampal segmentation in 1 × 1 × 1 mm3 MNI stereotaxic space), orders of magnitude faster than conventional multiatlas segmentation methods.

verse and fully automated segmentation methods have been proposed. While earlier segmentation methods generally employed various heuristics tailored for the segmentation task at hand, more recent segmentation methods are generally more accurate and attempt to transfer labels from a set of expertly labelled images (atlases) to the target image. Some such methods have attempted to learn complex mappings between image features and labels using traditional machinelearning based classifiers (e.g. support vector machines (Boser et al., 1992) and random forests (Breiman, 2001)) combined with hand-crafted feature sets (Morra et al., 2010, Zikic et al., 2012, while others have found success transferring labels using a combination of linear or non-linear image registration with local or non-local label fusion (so-called 'multi-atlas segmentation' methods (Coupé et al., 2011, Heckemann et al., 2006, Iglesias and Sabuncu, 2015). Indeed, many state-of-the-art results (e.g. hippocampus segmentation (Zandifar et al., 2017) and brain extraction (Novosad and Collins, 2018)) exploit a complementary combination of both multi-atlas segmentation and machine-learning methods (e.g. error correction (Wang et al., 2011)), though the computational time required for segmentation is usually correspondingly greater.
More recently, convolutional neural networks (CNNs) (LeCun et al., 1989) have been used for MR image segmentation, obtaining similar or better performance compared to the previous stateof-the-art while requiring only a fraction of the processing time (despite typically long training times). CNNs are particularly attractive because they have the potential to model much more complicated functions without the need for handcrafted feature sets, instead autonomously learning to extract task-dependent discriminative features from the training data. Also in contrast to traditional machine-learning classifiers, by stacking many convolutional layers sequentially and/or by incorporating down-sampling operations into the network architectures, CNNs have the capacity to model increasingly complex and longrange spatial relationships in the input, contributing to their excellent performance on image segmentation tasks in particular. However, repeated convolutions and/or down-sampling steps produce coarse features, leading to low-resolution segmentations that can be particularly problematic when targeting smaller structures. Therefore explicitly multi-scale architectures are often preferred, which are capable of preserving local detail while still enabling the modeling of complex long-range spatial relationships. For example, Kamnitsas et al. (2017b) and Ghafoorian et al. (2017a) both adopt multi-scale, multi-path architectures which take as input patches extracted at different resolutions, and perform late fusion between the extracted features from the different resolutions. Other works adapt popular architectures such as the U-Net (Ronneberger et al., 2015) (adapted in Roy et al. (2018)) and DenseNet (Huang et al., 2017) (adapted in Dolz et al. (2017bDolz et al. ( , 2017cDolz et al. ( , 2017a), both of which use skip connections in order to leverage multi-scale information.
Due to hardware limitations of modern graphics processing units (GPUs), modern volumetric medical images (e.g. MR or tomography scans) typically cannot fit into memory, and need to be sub-sampled in order to be processed by a CNN. Most commonly, 3D networks are trained on smaller 3D patches, or 2D networks on single 2D slices. Therefore, despite recent architectures which are capable of better modelling complex, long-range and multi-scale spatial relationships in the input, the implicit spatial context available to the network is still limited. It is therefore often useful to explicitly provide the network with additional spatial contextual information. Examples of applications which leverage spatial contextual features include that of Brebisson et al. (2015), which incorporates distances from pre-defined neuroanatomical structures, that of Wachinger et al. (2017), which incorporates spatial and spectral coordinates (by computing eigenfunctions of the Laplace-Beltrami operator on a pre-estimated brain mask), that of Kushibar et al. (2018), which incorporates nonlinear-registration-based atlas probabilities, and that of Ghafoorian et al. (2017a), which incorporates a hand-crafted combination of such features. While the addition of such features has been shown to result in better performance, the computation of such features is often extremely expensive relative to the time required to apply a trained CNN, limiting the efficiency of the methods.
In this work, we propose a novel CNN-based method for the automated segmentation of neuroanatomy in brain MR images. To maximize spatial context available to the network, we combine a deep 3D fully convolutional neural network with dense connections for efficient multiscale processing with explicitly provided spatial contextual information through the use of efficient linear-registration-derived spatial priors. We furthermore regularize our trained networks with a data augmentation scheme based on random elastic deformations, increasing the generalizability of the trained networks particularly in cases where limited labelled training subjects are available. In contrast to other recent CNN-based methods for MR segmentation, the scope of this work is also distinguished by the development goals of robustness and versatility. As such, we extensively validate our method on three diverse neuroanatomical segmentation tasks, showing in each case consistently more accurate and robust performance compared to state-of-the-art multiatlas segmentation and other CNN-based methods, while maintaining a highly competitive runtime performance. Importantly, using a scanrescan dataset, we also demonstrate that our proposed method achieves an accuracy comparable to the scan-rescan reliability of repeated expert manual segmentations.

Baseline network
In contrast to traditional machine-learning classifiers which treat their inputs as unordered vectors, CNNs explicitly treat their inputs as spatially structured images and work by extracting hierarchical and discriminative representations using sequential applications of the the core building-block known as a 'convolutional layer' (LeCun et al., 1989). The function of a convolutional layer is to convolve its input with multiple learned filters and then apply a non-linear activation function (otherwise, the network would just learn a linear transform of the input data). Assuming a simplified network architecture consisting only of convolutional layers, the convolutional filter W k,n l at network layer l is applied across all the m l−1 feature maps produced by the previous convolutional layer l − 1, resulting in a new set of feature maps, to which a position-wise non-linear activation function f () is applied. For example, the kth output feature map at layer l is given by: where m l is the number of convolutional filters in layer l, x n l−1 is the nth feature map of the input to layer l, W k,n l is the kth learnable filter, and b k l is the learnable bias.
We take as our starting point a 3D fully convolutional network, variants of which have shown success in tasks such as brain tumour and ischemic stroke lesion segmentation (Kamnitsas et al., 2017b), as well as sub-cortical structure segmentation (Dolz et al., 2017b). Instead of using fully connected layers and predicting the label of only one or several voxels for each input patch (Wachinger et al., 2017, Ghafoorian et al., 2017a, Kushibar et al., 2018, fully convolutional networks discard the fully connected layers and produce dense label estimates for whole patches at a time. Consequently, fully convolutional networks have many fewer parameters (and are therefore less prone to overfitting) and preserve the spatial structure of the input. Also following Kamnitsas et al. (2017b) and Dolz et al. (2017b), we entirely avoid down-sampling or max-pooling layers to preserve the spatial resolution of the output segmentations.
Our baseline architecture, takes as input a 3D patch with size 25 3 and N channels (e.g. different MRI contrasts), and returns a smaller 3D patch label estimate with volume 9 3 × C centered on the same respective spatial coordinates in image space, where C is the number of classes. Figure  1 depicts the network architecture schematically, and detailed architectural specifications (including the number of filters and the activation function at each convolutional layer) are provided in Table 1. First, a series of convolutional layers (L 1 through L 8 in Figure and Table 1) with filters of size 3 × 3 × 3 are applied, without padding and with unit stride (in order to preserve spatial resolution). We note that therefore each application of a 3 × 3 × 3 convolution layer reduces the size of the input feature maps by 2 voxels in each dimension: after eight 3 × 3 × 3 convolutional layers, the size of the feature maps is therefore reduced from 25 3 to 9 3 .
While the first layers of a CNN extract highresolution feature maps which respond to basic local image features such as edges, due to repeated convolution, the feature maps extracted from the deeper layers tend to have lower resolution and respond to more global and abstract image features. Ideally, a classifier should consider features extracted across all scales of the input, i.e. features extracted from each convolutional layer rather than only the last. To this end, similar to Dolz et al. (2017b), we follow Huang et al. (2017) and use a 'dense connection' after layer L 8 which consists of the channelwise concatenation of the feature maps produced by the preceding convolutional layers L 1 through L 8 . In this way, the last convolutional layers following the dense connection have direct access to the multi-scale feature maps produced by each the preceding convolutional layers, and are therefore capable of maintaining feature maps with high spatial resolution while also considering complex and long-range characteristics of the input. The dense connection also encourages feature re-use and improves the convergence properties of the network during training by providing a more direct path during backpropagation between the calculated loss and the earlier convolutional layers (Huang et al., 2017). We note that since the output feature maps produced by each convolutional layers have different sizes, only the central 9 3 voxels of each feature map are concatenated. The concatenated feature maps are then batch normalized (Ioffe and Szegedy, 2015) to make the first and second order statistics of feature maps consistent across channels, improving convergence.
The batch-normalized concatenated feature maps are then further processed by two convolutional layers (L 10 and L 11 in Figure 1 and Table  1) with filters of size 1 × 1 × 1. These layers serve to model inter-channel (and therefore also multiscale) dependencies and also to reduce the number of feature maps prior to being fed into final classification layer. Dropout (Srivastava et al., 2014) (with drop probability p = 0.1) is applied after both L 10 and L 11 to help regularize the model. The final classification layer (L 12 in Figure 1 and Table 1) processes the resulting features using a set of C filters (where C is the number of classes under consideration) of size 1 × 1 × 1, producing a probabilistic label estimate image p c of size 9×9×9 for each class c.
Network parameters (i.e. convolutional filters and biases) are estimated iteratively by optimizing the loss function in equation (2) using a gradientdescent optimizer over mini-batches of size B. The loss function L to be minimized is defined as: where α (empirically set to 1 × 10 −4 in our experiments) penalizes the l 2 norm of the network filters W , reducing overfitting, and J is the categorical cross-entropy loss: where p c v b is the output of the final classification layer for voxel v and class c, and c v b is the corresponding reference label. The training procedure is further detailed in Section 3.4.

Spatial coordinates
Though other works have explored the effects of explicitly augmenting their architectures with spatial coordinates and/or spatial probability maps, this is typically accomplished by concatenating a single vector to the output of a flattened fully connected layer (Wachinger et al., 2017, Ghafoorian et al., 2017a, Kushibar et al., 2018 which has no analogue in the proposed network. Furthermore, as our fully convolutional network makes predictions for whole patches rather than one or several voxels, the features associated with each voxel should be augmented with 23 3 ×32 21 3 ×32 9 3 ×32 9 3 ×256 9 3 ×128 9 3 ×64 9 3 ×C Figure 1: Schematic of baseline architecture. The network takes as input a 25 3 ×N patch (here, spatial coordinates patches are concatenated with the input image patch as described in Section 3.2.1) and returns a multi-channel probabilistic label estimate for the central 9 3 voxels. The dimensionality of the output of each layer is reported as size × number of feature maps. Detailed specifications for each layer are reported in Table 1.
their respective spatial coordinates rather than that of the central voxel only. We therefore make use of whole spatial coordinate patches: given an input patch centered on spatial coordinate (x, y, z) in image space, we extract three additional patches (with the same spatial dimensions as the input patch) centered on spatial coordinate (x, y, z) from each of three 'coordinate images'. For example, for the x-coordinate image, the value at spatial coordinate (x, y, z) is simply x, and similar for the y-and z-coordinate images. The spatial coordinate patches are then concatenated with the image intensity patch before being fed into the first layer of the network.

Working volumes
In order to benefit from the explicit incorporation of spatial coordinates into the network input, it is important that all images are spatially aligned. In this work we accomplish this by using a light pre-processing pipeline which incorporates linear registration to a common space. For anatomically well defined structures which present relatively little variation in shape and location after registration, we can further take advantage of this registration step by defining, given a set of training subjects, a working volume in which patches are extracted when training and applying the networks (we note that during training, we sample an equal number of patches from each class as described in Section 3.4). For each structure of interest c, we first obtain a class-specific boundary-like working volume B c by subtracting the union where D is the dilation structuring element (here set to 3 × 3 × 3 voxels). We also define a corresponding positive volume P c = U c − (U c B c ). We then define obtain the final working volume B = C c=1 B c and the final positive volume P = C c=1 P c . Example working volumes are shown in Figure 2.
When training the networks, all patches are drawn from the boundary-like working volume B. When applying a trained network, the final label estimate is obtained by adding the label estimate within the working volume B with the positive mask P . Using such working volumes has several advantages. First, it forces the network to learn from challenging examples, which usually occur near the boundary of the object of interest. Second, it substantially decreases the time required to apply trained networks by removing areas which are highly confidently foreground or background from consideration.

Data augmentation with random elastic deformations
Deep neural networks, which have a high modelling capacity, are particularly dependent on the availability of large quantities of labelled training data in order to generalize well to new unseen test data. In the context of MRI segmentation, low numbers of training samples are typically encountered due to the high cost of generating manually annotated data. To remedy this problem, data augmentation can be used to artificially expand the training set. Commonly, this is accomplished by applying user-specified but labelpreserving transformations to the training data, such as reflections, rotations and flips. How-ever, since in the present work all images are linearly registered in a common space and are therefore approximately the same size and with the same orientation, these transformations would be counterproductive. Rather, the relevant differences between linearly registered images are local and non-linear in nature. To create plausible synthetic training samples, we therefore chose to apply random 3D elastic deformations using a method based on Simard et al. (2003).
To generate a random elastic deformation, we first generate a 3D vector field (where each vector element specifies the per-pixel displacement in each of the x, y and z directions respectively) with the same spatial dimensions as the input samples, and then assign each vector element a random value selected from the uniform distribution U(−1, 1). The vector field is then smoothed in each direction using Gaussian kernels with standard deviation σ e (controlling the elasticity of the deformation), normalized to have a mean per-pixel displacement of one, and then multiplied by a constant α i (controlling the intensity of the deformation), producing the final deformation. During training, one such unique random elastic deformation is generated and used to interpolate each training sample (i.e. the image appearance patch, the three spatial coordinates patches, and the reference label image) using linear interpolation. We note that applying linear interpolation introduces a slight blurring in the label images, which itself can be useful as a regularization technique (Szegedy et al., 2016). The parameters σ e and α i were determined using a coarse grid search, detailed in Section 4.1.3.

Training and testing
Network parameters θ are optimized iteratively using RMSProp (Tieleman and Hinton, 2012), an adaptive stochastic gradient descent algorithm with Nesterov momentum (Nesterov, 1983) (momentum = 0.9) for acceleration. At each training iteration, we sample approximately 2,000 voxels, with an equal number of voxels sampled from each training subject. Since CNNs are sensitive to class imbalance, we sample an equal number of voxels from each structure (background in-cluded). Training samples (i.e. whole patches) are then extracted around each selected voxel, and image appearance patches are individually normalized to zero mean and unit standard deviation. All training samples are then randomly shuffled and processed by the network in batches of size B. Network weights are randomly initialized with the Glorot method (Glorot and Bengio, 2010), and all biases are initialized to zero. Training was performed on a single NVIDIA TITAN X with 12GB GPU memory. Software was coded in Python, and used Lasagne (Dieleman et al., 2015), a lightweight library to build and train the neural networks in Theano (Al-Rfou et al., 2016).
To counter over-fitting we employ the earlystopping technique, whereby a randomly selected validation subject set (taken here to be 20%) is held out from the training subject set. Before training, a fixed validation set is obtained by randomly sampling a fixed number of patches from each validation subject within the working volume. Unlike during the training phase, we extract the validation set uniformly (i.e. without enforcing class balance), so that the distribution of classes in the validation set better approximates the true distribution of classes within the working volume. During training, at each iteration, the average categorical cross-entropy loss (Equation (3)) over the validation set is measured. The final weights for the trained model are taken from the iteration which achieved the lowest validation loss, and training is stopped if the previously attained lowest validation error does not further decrease after 30 iterations. A static learning rate of 2.5 × 10 −4 is used for training the networks. This value was empirically determined in our preliminary experiments following the suggestions described by Bengio et al. (2012), i.e. by roughly finding the smallest learning rate which caused training to diverge, and then dividing it in half. When training the baseline network, we process the samples in smaller batches of size B = 128.
At testing, we apply the trained network with a stride of 4 voxels in each dimension to estimate the labels within the working volume only. The label estimate within the working volume is then added to the positive mask (see Section 3.2.2), obtaining the final label estimate. We further apply a fast and simple post-processing step which consists in only keeping the largest connected component for each label, thereby eliminating isolated clusters of false positives.

Experiments and results
We first assessed the impact of spatial priors, architecture depth and width, and data augmentation on the task of hippocampal segmentation in the ADNI (http://adni.loni.usc.edu) (Mueller et al., 2005) dataset.
To demonstrate the versatility of the proposed segmentation method, we further applied it sub-cortical segmentation using the IBSR (http://www.nitrc.org/projects/ibsr/) dataset, and multi-structure segmentation using the OASIS (https://www.oasis-brains.org/) (Marcus et al., 2007) scan-rescan dataset. Dataset and pre-processing specifications are provided in the respective sections below.
We assess segmentation accuracy and reliability using the Dice coefficient. The Dice coefficient measures the extent of spatial overlap between two binary images: where A is an automatically segmented label image, R is the reference label image, ∩ is the intersection, and | · | counts the number of non-zero elements. We here express the Dice coefficient as a percentage, with 100% indicating perfect overlap. We note that for multi-label images, we compute the Dice coefficient for each structure independently.
Segmentations with high general overlap may nonetheless have clinically important differences in their boundaries. To measure these differences, we also use the modified Hausdorff distance (MHD) (Dubuisson and Jain, 1994): where h(A, R) is the mean distance of the set of minimum distances between each labelled voxel in A and its nearest labelled voxel in R; h(R, A) is computed similarly. Note that for the MHD, lower values are better.
Finally, we assess the statistical significance of differences between distributions of Dice coefficients or MHD values using non-parametric Wilcoxon signed-ranked tests.

Application to hippocampus segmentation:
effect of spatial priors, architecture and data augmentation The hippocampal dataset consists of sixty T1weighted (T1w) 1.5T scans with manually segmented (Pruessner et al., 2000) left and right hippocampi. Twenty subjects were selected from each of the following clinical subgroups: normal controls, mild cognitive impairment, and Alzheimer's disease. Since this dataset was previously used to compare several state-of-the-art algorithms (Zandifar et al., 2017), for our experiments, we use the same data (e.g. previously pre-processed as described in Zandifar et al. to enable meaningful comparisons with the results reported in the aforementioned work. Pre-processing consisted of patch-based denoising (Coupé et al., 2008), N3 non-uniformity correction (Sled et al., 1998), linear intensity normalization, and affine registration to the MNI-ICBM152 template  with 1 × 1 × 1 mm 3 resolution. We trained our networks to segment both the right and left hippocampi. To obtain a segmentation for each subject, we carried out a 5-fold cross-validation.

Effect of spatial priors
Mean Dice and MHD values for several variants of the baseline network (CNN-B) are reported in Table 2. We also include in Table 2 results without the post-processing step (keeping only the largest connected component for each segmented structure, i.e. removing isolated clusters of false positives). Post-processing was crucial for obtaining good performance with CNN-B, (e.g. reducing the mean MHD from 4.59 mm to 0.27 mm), but was less important when applied to the methods using either working volumes or spatial coordinates, and still less important when applied to the method incorporating both spatial priors (CNN-SP). For the fairest possible comparison, we below compare the different methods when combined with post-processing.
Augmenting CNN-B with spatial coordinates (CNN-SC) only, or performing the training and testing within the working volume (CNN-WV) only, both resulted in statistically significant (p ≤ 10 −3 ) increases in performance with respect to both mean Dice and mean MHD. Combining both spatial priors resulted in the best performance (mean Dice = 91.5%, mean MHD = 0.23 mm), a statistically significant improvement over both CNN-SC and CNN-WV with respect to both mean Dice (p ≤ 0.005) and mean MHD (p ≤ 0.01). Using the working volume also drastically increased computational efficiency reducing the mean processing time from 28.3 ± 1.7 seconds per subject to 3.5 ± 0.5 seconds per subject.
Combining both spatial priors (CNN-SP) resulted in the best performance (mean Dice = 91.5%, mean MHD = 0.23 mm), a statistically significant improvement over both CNN-SC and CNN-WV with respect to both mean Dice (p ≤ 0.005) and mean MHD (p ≤ 0.01). Example segmentations showing the improvements due to the use of spatial priors are displayed in Figure 3. Note that these segmentations are shown without post-processing (consisting of keeping only the largest connected component for each label) to better understand the origin of the errors made by the two methods. One can see that without spatial priors, the baseline network CNN-B has more difficulty distinguishing between left and right hippocampi, and produces isolated clusters of false positives. Both of these errors, as shown in Figure 3, can be largely addressed by supplying the network with adequate spatial context.

Architectural modifications
We assessed whether the performance of the CNN-SP method could be further improved by widening (learning more filters per convolutional layer) or deepening (including more convolutional layers) the network architecture. We note that for the deeper networks, we correspondingly increased the size of the input samples in order to preserve the output size. Also, because of the increased memory requirements associated with deeper architecture, during training, we reduced the batch . The subjects with the worst, second worst, median, second best and best overlaps after applying CNN-B are shown for comparison. We note that without the spatial priors, the network has more difficulty distinguishing between the left (overlaid in green) and right (overlaid in blue) hippocampi, and produces isolated clusters of false positives. Errors are overlaid in red in columns three and five.
size to B = 32 as needed. Quantitative results are reported in Table 3. Widening the network by doubling the number of learnable filters (from 32 to 64) in convolutional layers L 1 through L 8 produced no appreciable gain in performance. However, deepening the network by increasing the number of 3×3×3 convolutional layers resulted in a gradual increase in performance with respect to both mean Dice and MHD, with a plateau reached when using eighteen or twenty such convolutional layers (corresponding to input samples with spatial dimensions 39 3 or 41 3 respectively). We note that the mean runtime of the deepest networks was correspondingly higher (e.g. (10.4±0.5) seconds per subject with twenty 3×3×3 convolutional layers) compared to CNN-SP ((3.5 ± 0.5) seconds per subject). For subsequent experiments, we opt to evaluate the deepest network, and denote the architecture by 'CNN-SP-D'.

Data augmentation
One concern when training deep convolutional neural networks with high modelling capacities is their increased tendency to overfit the training data, thus generalizing poorly when applied to new unseen testing data. As discussed in Section 3.3, data augmentation can be used to synthesize new training data to increase the generalizability of the trained networks. Using the CNN-SP-D architecture (i.e. twenty 3 × 3 × 3 convolutional layers, corresponding to input samples with spatial dimension 41 3 ), we assessed the impact of random    elastic deformation for data augmentation in two scenarios: the first, in which all available training subjects are used, and the second, in which only a randomly selected subset (25%, or 12 subjects) of the training subjects in each training fold is used (leaving the test folds unchanged). To determine the two parameters σ e and α i associated with our data augmentation scheme (see Section 3.3), we conducted a coarse grid search (applied in the second scenario) over σ e = {4, 8, 16} mm and α i = {1, 2, 4, 8} mm, and found the best performance with σ e = 4 mm and α i = 2 mm. These parameters are used for data augmentation in the remaining experiments. Using these parameters, example randomly deformed training samples are shown in Figure 4.  Table 4. As expected, the benefit of using data augmentation was largest when fewer training subjects were used: when using only 25% of the available training subjects (12 training subjects per fold), training the networks with data augmentation increased the mean Dice coefficient by 1.1% (p < 1 × 10 −11 ) and reduced the standard deviation by 0.8% compared to training the networks without augmentation. The relative increase in mean Dice coefficient was reduced to only 0.2% (p = 0.06) when using all available training subjects. With regards to mean MHD, data augmentation resulted in improved performance only in the low-data regime, reducing mean MHD from 0.28 mm to 0.24 mm (p < 1 × 10 −8 ).

Comparison to other methods
We further compared several variants of our CNN-based method with several other popular and/or state-of-the-art segmentation methods on the same dataset using the segmentations previously produced in the work of Zandifar et  al. (2017) (see Table 5), which includes results for four different methods, both before and after applying error correction (EC) (Wang et al., 2011), a machine learning based wrapper which attempts to correct systematic errors made by the initial host segmentation method. The methods included are FreeSurfer (Fischl, 2012), ANI-MAL (Collins and Pruessner, 2010) (a multi-atlas technique combining non-linear registration with majority-vote label fusion), traditional patchbased (PB) segmentation (Coupé et al., 2011) (a multi-atlas technique combining linear registration with patch-based label fusion), and an augmented approach combining patch-based segmentation with non-linear registration. Compared to the best method from (Zandifar et al., 2017), which combines patchbased segmentation with non-linear registration and error correction (PBS + NLR + EC), our best performing method (CNN-SP-D + DA) yielded an improvement of 2.1% in terms of mean Dice and a decrease in mean MHD of 0.17 mm (over both left and right hippocampi), both of which were statistically significant (p ≤ 10 −9 ). Example hippocampal segmentations comparing PBS + NLR + EC to CNN-SP-D + DA are displayed in Figure 5. We note that our method better avoids many of the errors made by the multi-atlas segmentation method, even after applying error correction to the latter.

Applications to sub-cortical segmentation in IBSR dataset and multi-structure segmentation in OASIS scan-rescan dataset
To demonstrate the versatility of our method, we further applied it to sub-cortical segmentation in the IBSR dataset, and multi-structure segmentation in the OASIS scan-rescan dataset. Dataset details are provided in the respective sections below. We note that for the subsequent experiments, the network architecture is almost identical to that of the full proposed method used in the hippocampus segmentation experiments -only the number of output channels C (e.g. the number of classes) was changed for each respective segmentation task.

PBS + NLR + EC CNN-SP-D + DA
2nd Worst Median 2nd Best Best Figure 5: Example right hippocampus segmentations and respective errors using NLR + PB + EC and our best performing method CNN-SP-D + DA. The subjects with the worst, second worst, median, second best and best overlaps after applying NLR + PB + EC are shown for comparison. Errors are overlaid in red in columns three and five.

Sub-cortical segmentation in the IBSR dataset
The IBSR dataset consists of 18 T1w scans which have been previously manually segmented into 32 structures. Of the 32 structures, as done in Dolz et al. (2017b), we considered the left and right thalamus, caudate, putamen and pallidum, for a total of 9 classes (one class being background). Though the IBSR images are already roughly aligned, they differ in voxel sizes (ranging from 0.84 × 0.84 × 1.5 mm 3 to 1 × 1 × 1.5 mm 3 ) and would likely benefit from a finer-grained registration. However, to demonstrate the robustness of our approach to small misalignments, we opted against this refinement step and used the images without additional pre-processing. To obtain a segmentation for each subject, we carried out a 6-fold cross-validation. We compared several variants of our method to the methods of Dolz et al. (2017b) and to a 2.5D CNN method (Kushibar et al. (2018)) that uses nonlinear registration to incorporate spatial probability maps. We note that for the method of Dolz et al., we used publicly available automated segmentations (https://github.com/josedolz/3D-F-CNN-BrainStruct/tree/master/Results/IBSR) to calculate performance measures and statistical significance tests, whereas for the 2.5D CNN method we include results exactly as reported in Kushibar et al., but could not perform statistical significance tests since the automated segmentations are not available for download. Lastly, we also include results from our application of FIRST (Patenaude et al., 2011) from the FMRIB Software Library (FSL) (Jenkinson et al., 2012) toolbox. Table 6 summarizes the performance of the various methods for each sub-cortical structure with respect to mean Dice and mean MHD respectively. Of our CNN-based methods, the performance of the baseline network CNN-B was poorest overall (mean Dice = 86.5% over all 8 structures). Incorporating spatial priors, CNN-SP produced a large (p < 1 × 10 −9 ) increase in overall performance (mean Dice = 88.5%). Further deepening the network (CNN-SP-D) produced no significant increase in performance with respect to overall mean Dice (p = 0.76), which could be likely attributed to an increased capacity for over-fitting due to the higher modelling capacity of the deeper network combined with the highly limited training data (i.e. only 15 subjects per cross-validation fold) associated with the IBSR dataset. Indeed, regularizing the deeper network using data augmentation (CNN-SP-D + DA) produced a large (p < 1 × 10 −9 ) increase in overlap (mean Dice = 89.5% over all structures) compared to CNN-SP-D. A similar pattern was observed with respect to mean MHD.
Comparing our methods to those from other works, our best performing method without data augmentation (CNN-SP-D) performed similarly to the 2.5D CNN approach, achieving segmentations with slightly better overlap in the putamen and pallidum, and slightly worse overlap in the thalamus and caudate. However, we emphasize that because the proposed method does not depend on expensive non-linear registration, the mean runtime of CNN-SP-D ((16.4 ± 2.0) seconds per subject) is much lower than the runtime of approximately five minutes per subject as reported in Kushibar et al. The method of Dolz et al. performed better than our baseline method CNN-B with respect to both mean Dice (p = 3 × 10 −4 ) and mean MHD (p = 2 × 10 −4 ), but was outperformed by each of CNN-SP, CNN-SP-D. Finally, our best performing method using data augmentation, CNN-SP-D + DA, performed best out of all six CNN-based methods. However, we note that the data augmentation scheme used in this work is general and could be used to also boost the performance of the other CNN-based methods under comparison.
Example segmentations comparing the ap-proach of Dolz et al. to our best performing method (CNN-SP-D + DA) are shown in Figure  6. Compared to the former method, our method produced generally smoother segmentations and better avoided more drastic errors typified in the first and second rows of Figure 6.

Multi-structure segmentation in the OASIS scan-rescan dataset
The OASIS scan-rescan dataset contains T1w scans of 20 healthy young adult subjects each scanned on two separate occasions within a period of 90 days. Expert manually generated labels of both session images are available by subscription to Neuromorphometrics (http://www.neuromorphometrics.com). To estimate the scan-rescan reliability of the different labelling methods, second-session images were registered (by estimating a 6-parameter rigid transformation) to the corresponding first-session images, and the labels were propagated using the corresponding rigid transformations with nearestneighbour interpolation. Accuracy was assessed using a 5-fold cross validation on images from the first session only.
We note that the estimated scan-rescan reliability of manual labelling will necessarily be lower than that of the true reliability of manual labelling (i.e. estimated by comparing multiple manual labellings performed on the same image), the latter being traditionally used as an upper-bound estimate for the attainable accuracy of automated segmentation methods. This is because estimating scan-rescan reliability has two additional sources of error: errors in rigid registration, and artefacts produced during interpolation. Therefore, we consider the scan-rescan reliability of manual labellings as a lower bound estimate on the true reliability of manual labellings. While the manual labellings for this dataset were carried out in accordance with a strict protocol, no special effort was made to make the boundaries between regions as smooth as possible. In preliminary studies, we noticed that manual reliability estimates appeared artificially low because of these noisy boundaries. We therefore smoothed the manual labels using median filtering with a small 3×3×3 kernel, which  resulted in higher and more reasonable manual reliability estimates.
For comparison, we also consider FIRST segmentations, since FIRST has been previously shown to be highly reliable (Morey et al., 2010). In our comparisons, for compatibility with FIRST segmentations, we considered the left and right thalamus, caudate, putamen, pallidum, hippocampus and amygdala for a total of 13 classes (one class being background). All CNN-based methods used pre-processed data, with preprocessing consisting of N3 non-uniformity correction, affine registration to the MNI-ICBM152 template space with 1 × 1 × 1 mm 3 resolution, and linear intensity normalization. To obtain a segmentation for each subject, we performed a 5fold cross validation experiment. We applied the same trained network to both session images for each subject in the respective test fold. All label estimates were then resampled back to native space, using nearest-neighbour interpolation, with the inverse of the corresponding transform estimated during the pre-processing stage.
The scan-rescan reliabilities of manual labelling as well as the automated methods under comparison are reported in Table 7. Each of the automated methods under comparison produced more reliable segmentations compared to manual segmentation. In general, FIRST was comparably reliable (mean Dice = 91.7%, mean MHD = 0.24 mm, over all 12 structures) to CNN-SP, and more reliable compared to CNN-B (mean Dice = 90.7%, mean MHD = 0.44 mm). Both CNN-SP-D and CNN-SP-D + DA produced the most reliable segmentations (mean Dice ≥ 92.2%, mean MHD ≤ 0.21 mm), and were the most consistently reliable methods, producing small standard deviations on distributions of Dice coefficients and MHD values.
The accuracy of the automated methods under comparison are reported in Table 8. Despite FIRST being a highly reliable method, it was overall the least accurate method under comparison (mean Dice = 78.0%, mean MHD = 0.77 mm, over all 12 structures). This is likely due to the fact that FIRST does not learn from user-specified training data, but instead incorporates priors derived from its own training data, which may differ with respect to the anatomical protocol used for manual labelling, as well as with respect to the quality of such manual labels. Example segmentations compared FIRST to our best performing CNN-based method are shown in Figure 7.
Among the CNN-based methods, CNN-B (mean Dice = 84.6%, mean MHD = 0.50 mm) again performed poorest overall, followed by CNN-SP (mean Dice = 85.0%, mean MHD = 0.48 mm). It is worth noting that in this dataset, augmenting the baseline network with spatial priors provided a relatively minor performance gain compared to the previous segmentation tasks. Using a deeper network (CNN-SP-D), however, proved particularly beneficial for this task (mean Dice = 86.0%, mean MHD = 0.42 mm). Finally, CNN-SP-D was further improved using data augmentation (CNN-SP-D + DA), resulting in the best performance (mean Dice = 86.6%, mean MHD = 0.41 mm). Indeed, a Wilcoxon signed-rank test between between the accuracy of CNN-SP-D + DA and the reliability of manual labellings (mean Dice = 89.9%, mean MHD = 0.39 mm) showed no significant difference with respect to mean Dice (p = 0.72); however, the mean MHD of manual labellings was slightly but significantly lower (p = 0.02) compared to CNN-SP-D + DA.

Discussion
While many of the other segmentation methods compared in this work also performed reasonably well in general, using a segmentation method with higher accuracy and consistency (i.e. fewer outliers) is always preferable as it increases the separability of populations, enabling controlled trials to be executed with fewer subjects, in turn reducing both cost and duration (Cover et al., 2016, Velasco-Annis et al., 2017. Furthermore, high scan-rescan reliability is a particularly important quality when assessing longitudinal differences in both individuals and groups, and good runtime performance is essential for both clinical and largescale applications. Finally, unlike many other segmentation methods compared in this work, our    Figure 7: Example multi-structure segmentations and respective errors using FIRST and our best performing method CNN-SP-D + DA. The subjects with the worst, second worst, median, second best and best overlaps after applying FIRST are shown for comparison. The hippocampus is overlaid in blue, amygdala in brown, thalamus in yellow, putamen in pink, and caudate in light purple. Errors are overlaid in red in columns three and five. proposed method has been demonstrated to be highly versatile and can be easily extended to any segmentation task for which labelled training data are available. In each of three different segmentation experiments, we found our method to be more accurate, robust (marked by fewer outliers) and consistent (marked by lower standard deviations on distributions of Dice coefficients) compared to other state-of-the-art algorithms. Using a scanrescan dataset, we further demonstrated that the proposed method is highly reliable and produces segmentations of quality comparable to that of expert manual labelling. Because of the high degree of regularity in the location of many neuroanatomical structures when normalized to a common space, spatial context is a powerful tool to exploit in MRI segmenta-tion. In Section 4.1.1 it was demonstrated that using spatial priors to assist CNN-based segmentation not only improves performance, but also drastically reduces the computation time required for applying a trained CNN. While a major advantage of deep-learning methods over traditional multiatlas segmentation is their reduced reliance on extensive pre-processing, our method only requires linear registration to a common space, which is fast (requiring about 20 seconds using a single CPU core), robust (Dadar et al. (2018) reports a failure rate of under 0.5% associated with the linear registration algorithm applied in this work), and in many cases necessary for subsequent processing steps. Further performance gains could possibly be obtained by using non-linear registration to a common template: the use of non-linear registration would, in ideal circumstances, produce more restrictive working volumes (further reducing processing time when applying a trained CNN), and increase the predictive power of spatial coordinates. However, the use of non-linear registration introduces several practical complications: traditional non-linear registration is extremely computationally expensive relative to the time required to apply a trained CNN, and studyspecific templates are often required for robust non-linear registration. Combining our approach with deep-learning approaches for non-linear image registration, which have potential for much better computational efficiency, may be a promising avenue for future work. We emphasize however that any performance gains due to the use of spatial priors are to be expected only in proportion to the spatial regularity of the structure of interest. For example, it would not be helpful to use either a working volume or spatial coordinates for the segmentation of brain tumours, which are highly heterogeneous in shape, size, appearance, and location.
Deep networks like the ones used in this work have a high modelling capacity and are therefore can be more prone to overfitting, particularly when few training samples are available. Indeed this is commonly the case for tasks such as neuroanatomical segmentation, where generating large quantities of high quality training data is a very tedious and time consuming task. While sub-sampling a volume into smaller sub-volumes (patches) is effectively a form of data augmentation, many of the patches extracted from or nearby a particular structure of a given subject will overlap to a large extent (particularly for small structures) and will therefore be somewhat redundant. Overfitting is therefore still possible (as observed in Section 4.1.3), making more aggressive data augmentation schemes necessary for training networks with good generalizability. While many other techniques have been proposed to deal with limited training data (e.g. fine-tuning networks pre-trained on automatic segmentations as done in Roy et al. (2018)), we demonstrated excellent performance using a data augmentation scheme based on random elastic deformations. In the future, more advanced deformation-based techniques could be investigated, e.g. learning a more limited space of plausible deformations using statistical modelling techniques (Onofrey et al., 2015, Hauberg et al., 2016, however these methods often come with additional computational costs which may not be justified given the already excellent performance of our proposed method. Further contributing to the good performance of the proposed method in cases of very limited training data is the relatively low number of parameters associated with our networks (∼ 5 × 10 5 parameters in our deep network, limiting the capacity to overfit (for comparison, we note that the original U-Net architecture (Ronneberger et al., 2015) has ∼ 2 × 10 7 parameters). This is in large part due to our choice of using only 32 learnable convolutional filters per layer, since widening the networks showed no appreciable improvement in performance (see Section 4.1.2). On the other hand, increasing the depth of the network (and correspondingly increasing the size of the input patch) resulted in considerable performance gains, which can be attributed to the increased spatial context available to the network in addition to a much higher modelling capacity. Indeed, it has been demonstrated that making networks deeper, as opposed to wider, is a more parameter-efficient way of increasing the modelling capacity of a network (Eldan and Shamir, 2016). While it is likely that the performance of our network could be further improved by fine-tuning the network architecture for specific segmentation tasks, we opted against such an approach to highlight the versatility of this particular network architecture.
A related problem concerns that of generalization across datasets. Since the learned convolutional layers (particularly deeper into the network (Ghafoorian et al., 2017b, Kamnitsas et al., 2017a) are highly tailored to the peculiarities of the training data, it is commonly the case that networks trained on a certain dataset perform poorly when applied to an unseen dataset. Nonetheless, robustness to differences between training and testing images (e.g. due to differences in age, health, scanner type, field strength, and/or acquisition sequence) is a highly desirable quality of any method for MRI segmentation. In each of our experiments, we trained and vali-dated our classifiers on challenging multi-centre datasets. However, still more challenging scenarios are commonly encountered in practice; for example, given the often prohibitively high cost of generating high quality manual labellings, it may be desirable to apply a trained classifier to images do not have adequate representation whatsoever in the training set. Future work will address this problem by leveraging so-called 'domain adaptation' methods (e.g. (Ganin et al., 2016, Hoffman et al., 2016) to learn networks which are robust to differences between the training and target image domains, further increasing the general applicability of our approach.