Direct cortical thickness estimation using deep learning‐based anatomy segmentation and cortex parcellation

Abstract Accurate and reliable measures of cortical thickness from magnetic resonance imaging are an important biomarker to study neurodegenerative and neurological disorders. Diffeomorphic registration‐based cortical thickness (DiReCT) is a known technique to derive such measures from non‐surface‐based volumetric tissue maps. ANTs provides an open‐source method for estimating cortical thickness, derived by applying DiReCT to an atlas‐based segmentation. In this paper, we propose DL+DiReCT, a method using high‐quality deep learning‐based neuroanatomy segmentations followed by DiReCT, yielding accurate and reliable cortical thickness measures in a short time. We evaluate the methods on two independent datasets and compare the results against surface‐based measures from FreeSurfer. Good correlation of DL+DiReCT with FreeSurfer was observed (r = .887) for global mean cortical thickness compared to ANTs versus FreeSurfer (r = .608). Experiments suggest that both DiReCT‐based methods had higher sensitivity to changes in cortical thickness than Freesurfer. However, while ANTs showed low scan‐rescan robustness, DL+DiReCT showed similar robustness to Freesurfer. Effect‐sizes for group‐wise differences of healthy controls compared to individuals with dementia were highest with the deep learning‐based segmentation. DL+DiReCT is a promising combination of a deep learning‐based method with a traditional registration technique to detect subtle changes in cortical thickness.


| INTRODUCTION
The human cerebral cortex, a thin ribbon of gray matter constituting the outer layer of the cerebrum, is on average about 2.5 mm thick (Fischl & Dale, 2000). Cortical thickness decreases with normal aging (Salat et al., 2004), a process that is known to be accelerated in neurodegenerative diseases including dementia (Young et al., 2020).
The pattern of atrophy progression may enable to differentiate the underlying form of dementia, but also to characterize mild cognitive impairment (Karas et al., 2004). In the case of Alzheimer's disease (AD), the onset is usually located in the transentorhinal cortex and extending into the temporal lobe (Braak & Braak, 1991;Kulason et al., 2019) before spreading to other regions in the brain in a welldefined sequence in later stages of the disease (Thompson et al., 2003). Numerous studies have demonstrated that cortical thickness can serve as a surrogate marker for the underlying pathological changes (Frisoni, Fox, Jack, Scheltens, & Thompson, 2010;Singh et al., 2006;Whitwell et al., 2008). Quantitative morphometry and its regional patterns of atrophy are therefore considered a potential biomarker of clinical interest (Dickerson et al., 2009;Young et al., 2020).
The accuracy of these methods has been evaluated and compared to FreeSurfer by others (Clarkson et al., 2011;Tustison et al., 2014).
Registration-based solutions rely on good tissue segmentation of white-matter (WM), gray-matter (GM), and cortical (sulcal) cerebrospinal fluid (CSF). Under the assumption that the interfaces of WM/GM and GM/CSF share a common topology, the WM/GM boundary is deformed toward the GM/CSF boundary using a diffeomorphic registration and a thickness map is derived from the distance between corresponding points (Das et al., 2009). An open-source implementation of this diffeomorphic registration based cortical thickness (DiReCT) algorithm is available in ANTs  as part of the ANTs cortical thickness pipeline (Tustison et al., 2013) which applies DiReCT to segmentations derived from Atropos (Avants, Tustison, Wu, Cook, & Gee, 2011), an atlas-based segmentation method.
Deep-learning (DL) (LeCun, Bengio, & Hinton, 2015) is a promising technique for medical image analysis (Litjens et al., 2017), with image segmentation currently being the most used application of DL in neuroimage analysis (Yao, Cheng, Pan, & Kitamura, 2020). However, the adoption of deep neural networks for advanced tasks like extraction of biomarkers or direct prediction of diagnosis is challenged by the lack of interpretability, especially for clinical applications where trust in the model and traceability of the results is required (Ching et al., 2018).
For neuroanatomy segmentation, numerous recent publications have shown the superiority of DL over traditional methods (Dalca, Balakrishnan, Guttag, & Sabuncu, 2019;Roy et al., 2019;Wachinger, Reuter, & Klein, 2018). We hypothesize that registration-based thickness measures benefit from more accurate and reliable segmentations than usually available from atlas-based methods. Therefore, we propose DL+DiReCT: A Deep learning-based anatomy segmentation including parcellation of the cortex followed by DiReCT to derive cortical thickness measures directly from T1-weighted (T1w) MRI as outlined in Figure 1. We demonstrate its reliability with extended validation experiments on two datasets and show its potential to detect regional patterns of atrophy in dementia patients.  (Huang, Liu, Van Der Maaten, & Weinberger, 2017) in a U-net (Ronneberger, Fischer, & Brox, 2015) like structure. The model was trained with a total of 840 T1w MRI by combining publicly available F I G U R E 1 DL+DiReCT: Deep learning-based neuroanatomy segmentation followed by a diffeomorphic registration to estimate cortical thickness from MRI and internal data. From public datasets, we used data from 160 nine to ten year old children from ABCD (Casey et al., 2018), 160 healthy adults from IXI (brain-development.org/ixi-dataset), and 160 elderly people from ADNI (Jack Jr et al., 2008). Internal data from our institution (Inselspital) from previous studies comprised 160 healthy controls, 128 patients with multiple sclerosis, 48 patients with epilepsy, and 24 cases with Parkinson's disease. The input data was minimally preprocessed (see Section 2.2.1). For the supervised training, the following 96 weak labels from FreeSurfer 6.0 were used: left/right cerebral white matter, cortical gray matter including its Desikan-Killiany parcellations (Desikan et al., 2006), lateral ventricle, cerebellum (WM + GM), accumbens area, amygdala, caudate, hippocampus, pallidum, putamen, thalamus, ventral DC, and the central structures 3 rd ventricle, 4 th ventricle, brainstem, and corpus callosum. As we use a multi-label classification scheme accounting for class imbalances (Cui, Jia, Lin, Song, & Belongie, 2019), a voxel may be assigned more than one label, which allows robust identification of the cortical gray matter and an independent assignment of parcellation labels. The model was trained with focal loss (Lin, Goyal, Girshick, He, & Dollár, 2017) and a cosine annealing learning rate schedule for 100 epochs and a batch size of two.

| FreeSurfer
Results for FreeSurfer were generated using the recon-all pipeline of FreeSurfer 6.0 (Fischl, 2012) running on Linux on a single CPU. No manual corrections were made. Regional mean cortical thickness was extracted from the surface statistics (lh.aparc.stats, rh.aparc.stats).
Using the output from FreeSurfer, minimal preprocessing of the original T1w images was performed: resampling into FreeSurfer's space (mri_vol2vol) followed by an application of the brain mask from FreeSurfer. These 1 mm iso-voxel images with skull-stripped brains serve as input for the two methods below.

| ANTs
The results for ANTs were generated with the default cortical thickness pipeline (antsCorticalThickness.sh) (Tustison et al., 2013) running on Linux restricted to a single CPU. The publicly available OASIS-30_Atropos_template (Klein, 2016;Klein & Tourville, 2012) was used as population level template for Atropos (Avants et al., 2011). The resulting output is a voxel-wise volumetric thickness map.

| DL+DiReCT
Preliminary results by directly using the probability maps from the DL model as input for DiReCT suggested that preprocessing is required such that the input is more alike a hard segmentation. In particular, we actually used a binary image for the WM as suggested by Clarkson et al. (2011), which is reasonable knowing that the WM is the moving image in the registration that does not change topology (Das et al., 2009).
Where P w is the sigmoid of the logit output for the classification of WM labels, and equivalently P g for GM by taking the maximum logit of the cortical GM, Amygdalae, and Hippocampi labels, we calculated the input for DiReCT for every voxel x in the image volume as follows: In the hard segmentation P seg (Equation (1)), 2 corresponds to GM and 3 to WM, and argmax returns the position of the largest element starting at index 0. The probability maps P w 0 (Equation (2)) for WM and P g 0 (Equation (3)) for GM were constructed such that there is a well-defined WM/GM boundary. These preprocessing steps were determined empirically on an independent internal validation set.
From these volumes, the thickness map of DL+DiReCT was calculated by a diffeomorphic registration using DiReCT with convergence settings identical to ANTs.

| Parcellation-wise average cortical thickness
From the volumetric voxel-wise thickness map of ANTs and DL+DiReCT, we calculated average cortical thickness statistics for regions of interest (ROI) according to the Desikan-Killiany (DK) atlas (Desikan et al., 2006), providing 34 ROIs per hemisphere. For ANTs, we used the parcellation labels from FreeSurfer (aparc+aseg) and for DL+DiReCT the labels from the DL model. For DL+DiReCT we additionally calculated complementary results by also using the parcellation labels from FreeSurfer instead of the DL model. All voxels constituting the inner boundary of the gray-matter segmentation were identified and assigned the label of the closest parcellation.
Voxels further away than an Euclidean distance of ffiffiffiffiffiffiffi 3:0 p voxel dimensions were masked in order to exclude deep gray-matter structures.
Within this defined region of interest, the average over all nonzero voxel from the thickness map was calculated.

| Data for evaluation
For evaluation, we used T1-weighted (T1w) MRI from two publicly available datasets: OASIS-3 (LaMontagne et al., 2018) and SIMON (Duchesne et al., 2019), yielding a total of 2,736 images. OASIS-3 contains cross-sectional and longitudinal samples from cognitively normal adults, as well as participants at various stages of dementia, as assessed by the Clinical Dementia Rating (CDR) (Morris, 1991

| Evaluation
We processed the MR images (2, where N is the number of sessions with re-scans, n(i) the number of re-scans in the session i for a subject, m (i, t) the measurement at time- Þ the within-session mean. To regress out the effects of brain size, age, sex, and scanner on cortical thickness, we fit a linear model (lm) to the thickness of the healthy controls with the normalized (zero-mean, unit SD) co-variates estimated total intracranial volume (eTIV; Buckner et al., 2004;from FreeSurfer) and age, and categorical variables sex and scanner model.
In agreement with Im et al. (2008) the co-variate sex was not significantly related to thickness and was subsequently removed. Likewise, the scanner had no significant effect after accounting for multiple comparisons (Mundfrom, Perrett, Schaffer, Piccone, & Roozeboom, 2006), resulting in a lm(thick eTIV + age) that was then applied to all samples. On these thickness measures corrected for brain size and age, we calculated the effect size using Cohen's d (Torchiano, 2019) to quantify group-wise differences between healthy controls (CDR = 0) and subjects with dementia (CDR > = 1).
Additionally, we quantified longitudinal annual cortical GM atrophy rates separately for the three OASIS-3 sub-cohorts (HC, CDR = 0.5, and dementia) by using subjects who had more than one scan at least 1 year apart and who did not change sub-cohort (e.g., from HC to dementia) in that interval. Atrophy rates between methods were compared with a paired t-test and a significance level α = .05. Statistical analyses were performed using R with the stats package version 3.6.2 (R Core Team, 2019).

| RESULTS
The deep learning-based anatomy segmentation, when compared to FreeSurfer, reached median Dice coefficients above .97 for WM and above .95 for cortical GM on both datasets. Detailed performance for the relevant structures is reported in Supplementary Table S1. The average runtimes per image for the three methods were 9.34h ± 2.68h for FreeSurfer, 12.68h ± 0.90h for ANTs, and 1.18h ± 0.17h for DL+DiReCT.

| Correlation with FreeSurfer
On the OASIS-3 dataset (n = 2,643 images), the global mean thickness of DL+DiReCT was Pearson correlated with FreeSurfer with r = .887 T A B L E 1 Demographic information for the two datasets used for the evaluation while the results of the same test for ANTs were r = .608. A visualization of the region-wise correlation coefficients can be seen in

| Robustness
The mean reproducibility errors are listed in Table 2, for the global mean thickness measure and as an average over all 68 ROIs. On both datasets, OASIS-3 (n = 761 sessions) and SIMON (n = 14), we observed similar errors for FreeSurfer and DL+DiReCT and significantly higher error for ANTs as can be seen in Figure 5 for the OASIS-3 and Figures S1 and S2

| Annual atrophy rates
Longitudinal annual atrophy rates for global mean cortical thickness are listed in Table 3 for the three sub-cohorts: healthy controls (n = 368 subjects), CDR 0.5 (n = 31) and Dementia (n = 7). Mean and standard deviation were consistently lowest in FreeSurfer, slightly but statistically not significantly higher with DL+DiReCT and substantially higher with ANTs, which is also visible in Figure S4.
Regional atrophy rates are depicted in Figure 6.  Figure S5 shows these changes in relation to the global atrophy rate to make regional differences better visible. An additional cross-sectional analysis is depicted in Figure S3.

| Group-wise differences
Group-wise differences between HC and dementia for the global mean thicknesses corrected for brain size and age were largest with comparably robust to, surface-based measurement as used by FreeSurfer. In the absence of a gold-standard ground truth for cortical thickness measures, we assessed the accuracy on two independent datasets by indirect means: correlation with FreeSurfer, robustness with a large number of re-scans, cross-sectional and longitudinal graymatter atrophy rates, and sensitivity to detect group-wise differences of healthy controls compared to individuals with dementia. We have used FreeSurfer as the silver-standard ground truth in this study since we consider it the most established tool in the field. However, these surface-based measures are not meant to be replicated as close as possible but serve as a good reference point to complement outcomebased evaluations like robustness, effect size, and plausibility of observations in the light of underlying biological processes.
In agreement with the large-scale evaluation by Tustison et al.
(2014) on OASIS-3, we found almost identical values of the Pearson correlation coefficients with FreeSurfer for ANTs (r = .45). We also observed the higher sensitivity of ANTs to detect age-related graymatter atrophy. However, we conclude this additional sensitivity comes at the cost of lower robustness (mean reproducibility error of 3.1%, that is, more than twice the value of FreeSurfer) due to inferior atlas-based segmentation. By replacing the atlas-based segmentation with a deep learning-based model in DL+DiReCT, correlation with Freesurfer thickness measures can be significantly increased (r = .72), while also achieving robustness (mean reproducibility error 1.3%) comparable to FreeSurfer (mean reproducibility error 1.4%). These observations are in concordance with the study of Clarkson et al.
(2011) which also reports generally lower robustness for ANTs and higher standard deviations for longitudinal atrophy rates compared to FreeSurfer.
The sensitivity and robustness of DL+DiReCT permits the measurement of (cf., Table 3 entorhinal cortex Thompson et al., 2003), supporting the hypothesis of disease onset in these regions (Atiya, Hyman, Albert, & Killiany, 2003;Braak & Braak, 1991;Thompson et al., 2003). The results also suggest relative sparing of the somato-  DL+DiReCT runs in about 1 hour, producing segmentations and a volumetric thickness map (see Figure 1) that allows a visual inspection of the result by humans, partially opening the black-box of fast deep learning-based morphometry methods (Rebsamen, Suter, Wiest, Reyes, & Rummel, 2020).

| Limitations
Minimally preprocessed (skull-stripping and resampling into FreeSurfer space) input data was used to facilitate a direct comparison of results using FreeSurfer parcellations. As the original data was already in 1 mm iso-voxel resolution, we are confident that this does not significantly influence the results. By using the same skullstripping for all methods, we can avoid side effects from different brain extraction techniques.
We have not tried other deep learning-based segmentation methods, since finding the best network architecture was not the focus of this study. Other methods likely yield similar results if the model achieves high accuracy even with a large number of labels required for the cortex parcellation.
The number of longitudinal samples in the confirmed dementia cohort was low (n = 7), limiting the power of statistical tests. However, the plausibility of the results is supported by the observed atrophy patterns suggesting a trajectory of the disease known from literature and the additional cross-sectional analysis (n = 209). For specific analysis of longitudinal changes, the robustness might be further improved with the dedicated longitudinal pipeline available in FreeSurfer (Reuter, Schmansky, Rosas, & Fischl, 2012) and ANTs (Tustison et al., 2019), none of which was used in the current study to facilitate a direct comparison of the methods.

| Outlook
We intend to continue optimizing the proposed method. Namely investigating whether applying DiReCT to segmentations with higher spatial resolution, either from 7T imaging or via super-resolution, increases the sensitivity while preserving the good robustness. While DL+DiReCT is already substantially faster than the frequently used methods FreeSurfer and ANTs, the computationally most expensive step of the current solution is the 3D registration: replacement by a (separate) deep learning-based registration is expected to reduce the total runtime to a few minutes (Dalca et al., 2019;Mok, 2020). Additionally, measures for the cortical curvature and surface area could be derived from the diffeomorphic model. A dedicated longitudinal mode for DL+DiReCT is conceivable by deriving thickness changes from registering the inner and outer surface of two or more time points, which would also allow a direct comparison to FreeSurfer's longitudinal pipeline. The current deep-learning model is trained to predict 96 different labels. Further increasing the number of labels with a more fine-grained atlas like Destrieux (Destrieux, Fischl, Dale, & Halgren, 2010) with its 74 parcellations per hemisphere would be of interest as it is used in many morphometry studies, and might reveal where the model's segmentation performance starts deteriorating.
Future work will also extend the evaluation to further applications where subtle changes of cortical thickness are of high clinical interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available at https://central.xnat.org/ for OASIS-3 and at https://doi.org/10. 15387/fcp_indi.retro.simon for SIMON.