Accuracy of automated amygdala MRI segmentation approaches in Huntington's disease in the IMAGE‐HD cohort

Abstract Smaller manually‐segmented amygdala volumes have been associated with poorer motor and cognitive function in Huntington's disease (HD). Manual segmentation is the gold standard in terms of accuracy; however, automated methods may be necessary in large samples. Automated segmentation accuracy has not been determined for the amygdala in HD. We aimed to determine which of three automated approaches would most accurately segment amygdalae in HD: FreeSurfer, FIRST, and ANTS nonlinear registration followed by FIRST segmentation. T1‐weighted images for the IMAGE‐HD cohort including 35 presymptomatic HD (pre‐HD), 36 symptomatic HD (symp‐HD), and 34 healthy controls were segmented using FreeSurfer and FIRST. For the third approach, images were nonlinearly registered to an MNI template using ANTS, then segmented using FIRST. All automated methods overestimated amygdala volumes compared with manual segmentation. Dice overlap scores, indicating segmentation accuracy, were not significantly different between automated approaches. Manually segmented volumes were most statistically differentiable between groups, followed by those segmented by FreeSurfer, then ANTS/FIRST. FIRST‐segmented volumes did not differ between groups. All automated methods produced a bias where volume overestimation was more severe for smaller amygdalae. This bias was subtle for FreeSurfer, but marked for FIRST, and moderate for ANTS/FIRST. Further, FreeSurfer introduced a hemispheric bias not evident with manual segmentation, producing larger right amygdalae by 8%. To assist choice of segmentation approach, we provide sample size estimation graphs based on sample size and other factors. If automated segmentation is employed in samples of the current size, FreeSurfer may effectively distinguish amygdala volume between controls and HD.


| INTRODUCTION
Changes in emotion processing in Huntington's disease (HD) typically manifest in difficulties in recognizing facial expressions, and are part of a range of cognitive, psychiatric, and motor symptoms observed in this disorder (Bates et al., 2015;Henley et al., 2012;Papoutsi, Labuschagne, Tabrizi, & Stout, 2014;Paulsen, Ready, Hamilton, Mega, & Cummings, 2001;Stout et al., 2011). The atrophy seen in HD occurs in a spatiotemporally specific pattern (Fonteijn et al., 2012;Rosas et al., 2008), with some atrophy detectable during the presymptomatic (e.g., Bates et al., 2015;Ross et al., 2014) phase of the disease, that is, well before diagnosable signs and symptoms are present (Aylward et al., 2004;Paulsen, 2010). Neuroimaging studies in HD often focus on characterizing atrophy across stages of the condition, and clarifying relationships between regional atrophy and other symptoms. The amygdala has recently received increased attention in HD research because of its role in emotion processing deficits (e.g., Kipps, Duggins, McCusker, & Calder, 2007;Mason et al., 2015).
In terms of MRI segmentation accuracy in general, manual tracing is regarded the "gold standard"; however, manual segmentation is most often prohibitively time consuming and in the context of large cohort MRI studies (which are common in HD research) is rarely feasible (e.g., Hammers et al., 2003;Heckemann, Hajnal, Aljabar, Rueckert, & Hammers, 2006). Therefore, automatic methods for segmentation are essential.
Amygdala-specific segmentation tools have also been developed (Collins & Pruessner, 2010;Hanson et al., 2012;Saygin et al., 2017) though some are not publicly available (Collins & Pruessner, 2010;Hanson et al., 2012). In HD, many volumetric studies have used FreeSurfer or FIRST, which label multiple parcellated regions throughout the brain. Thus, we have focused on these widely used tools. The accuracy of FreeSurfer and FIRST has been previously compared with reference to gold standard manual segmentation in normal and clinical populations, and in different subcortical brain regions (Doring et al., 2011;Merkel et al., 2015;Morey et al., 2009;Mulder et al., 2014;Pardoe, Pell, Abbott, & Jackson, 2009;Perlaki et al., 2017;Schoemaker et al., 2016). Results have been mixed, and vary based on sample and brain region. With regards to the amygdala specifically, Morey et al. (2009) found that FreeSurfer performed better on some measures of accuracy in healthy adults and in a small sample (n = 9) of individuals with major depressive disorder. Schoemaker et al. (2016) found mixed results in preadolescent children, and suggested that segmentations derived via both FreeSurfer and FIRST may require manual corrections. These results, however, are not generalizable to HD, which has a unique neuropathological basis. The atrophy in amygdala and surrounding structures that occurs during the course of HD (Ahveninen, Stout, Georgiou-Karistianis, Lorenzetti, & Glikmann-Johnston, 2018), may influence the accuracy of amygdala segmentation. It is thus imperative to determine which of these pipelines is most appropriate for this clinical cohort.
Both FreeSurfer and FIRST pipelines implement registration and segmentation routines, and utilize Bayesian approaches to fit models that draw upon manually labeled training sets. There are many aspects of the processing pipelines that differ between the two tools, including the type of model used. Another point of difference is the registration approach used, and we focused on this aspect in the current article.
FreeSurfer's subcortical pipeline performs initial affine registration to the MNI 305 template (Evans, 1992), initial labeling, bias correction, then nonlinear registration to the MNI 305 template, which deforms the target image so it can match the template as closely as possible (Fischl et al., , 2004. FreeSurfer uses a model that incorporates anisotropic nonstationary Markov Random Fields to fit labels based on intensity as well as spatial location relative to neighboring structures. In comparison, FIRST performs an affine-only registration to the MNI152 nonlinear 1 mm template (Fonov et al., 2011) using FLIRT, and transforms the model to native space in order to capitalize on intensity information in the noninterpolated image. The model employed in FIRST is a Bayesian Appearance Model, which fits deformable shape meshes based on conditional probability of shape and intensity information (Patenaude et al., 2011). FIRST's use of linear transformations rather than nonlinear warping restricts how closely structures in a training set can be mapped onto those in a target image during the registration step. However, this is overcome by the Bayesian framework allowing shape meshes to deform beyond the shapes existing in the training set in order to match the target more closely (Patenaude et al., 2011). Considering the abnormal amygdala size seen in HD (Ahveninen et al., 2018), we were interested to determine whether segmentations performed by FIRST may be improved by performing initial nonlinear warping of the data to template space.
In the current article, we utilized the Australian-based IMAGE-HD cohort (including 35 pre-HD, 36 symp-HD, and 35 healthy controls), for which manual amygdala segmentation had been performed by Ahveninen et al. (2018). We aimed to identify the accuracy with which three automated segmentation approaches would segment the amygdala for this sample, by comparing the output of each pipeline with the manual segmentation, thereby identifying which is most appropriate for use in HD. We also aimed to provide estimates of sample sizes required to produce amygdala volumes that are statistically differentiable between HD and controls, and between pre-HD and symp-HD, for each automated approach. The automated approaches tested were FreeSurfer, the complete default FIRST pipeline, and FIRST's segmentation algorithm applied to whole-head images bias corrected and nonlinearly transformed into MNI space using ANTS.

| Participants
The sample comprised 106 participants aged 23 to 72 years from the IMAGE-HD study (Domínguez et al., 2013(Domínguez et al., , 2016Georgiou-Karistianis et al., 2013). These included 34 healthy controls, 35 presymptomatic huntingtin gene expansion carriers who had not developed motor symptoms at the time of scanning (termed 'pre-HD'), and 36 individuals with early stage symptomatic HD ('symp-HD'). One control participant was excluded due to failed MRI labeling via FIRST (described further in Section 2.3.2), resulting in 34 controls. HD participants were genetically confirmed to have the huntingtin gene expansion (≥38 CAG repeats), and were between 23 and 70 years of age, with no history of major neurological illness (except HD), significant head injury, or non-HD-related psychiatric disturbances. Participants with a UHDRS total motor score (TMS) ≤ 5 were included in the pre-HD group, and those with a UHDRS TMS score of 5 or above were included in the symp-HD group (Domínguez et al., 2013). Participants with pre-HD had Unified Huntington's Disease Rating Scale (UHDRS) diagnostic confidence levels of less than four, indicating that they had not received the HD diagnosis (Huntington Study Group, 1996). Participants with symp-HD had Stage 1 or Stage 2 HD.
Groups significantly differed in terms of age (F[2,102] = 8.701, p < .001), with the symp-HD group being older than the pre-HD group (p = .001), as is typically observed given the progressive nature of HD. The symp-HD group was also older than the control group (p = .003). We chose to retain all participants rather than using subsets of closer age in order to account for the progressive brain atrophy that is a fundamental characteristic of HD, and becomes more severe with older age. In doing so, we accept that there will be some proportion of atrophy in the symp-HD group attributable to normal ageing. Pre-HD and control groups did not differ in terms of age (p = .890). See Table 1

| Automatic segmentation
FreeSurfer T1-weighted images were input into the default pipeline of FreeSurfer 6.0, using the 'recon-all' command. The amygdalae were isolated from the resulting 'aseg' image.

FIRST
FIRST (Patenaude et al., 2011) was run using the 'run_first_all' script, which implements automatic registration to the MNI nonlinear 1 mm template (Fonov et al., 2011), and segmentation. We considered segmentation to have failed for one participant in the control group, due T A B L E 1 Demographic information for participants in HD groups (reproduced from Ahveninen et al., 2018), and for the subset of controls with successful segmentation for all methods to poor registration resulting in the amygdala label being placed too dorsally, partially overlying basal ganglia structures. Rerunning registration using the 'first_flirt' command with different parameters did not improve the registration. Thus, we excluded segmentation for this subject from the dataset. The 34 control individuals listed as participants are those for whom FIRST segmentation was completed successfully, from an initial group of 35 controls.

ANTS/FIRST
T1-weighted images were bias corrected using the 'N4BiasFieldCorrection' (Tustison et al., 2010) script in ANTS . Images were then registered to the MNI 1 mm nonlinear template using ANTS, with the 'AntsRegistration' and 'AntsApplyTransforms' scripts. An affine transformation was first performed, followed by a nonlinear transformation using symmetric diffeomorphic normalisation (SyN) with cross-correlation as the similarity metric. Segmentation was then run on the bias corrected, nonlinearly registered T1 images in MNI space using FIRST's 'run_first' script, using an identity matrix as the input transformation matrix.
FIRST's pipeline also includes bias correction, and we acknowledge that the images thus underwent bias correction multiple times for this

| Statistical analysis
We evaluated the accuracy of the automated amygdala segmentation approaches by: (a) computing Dice overlap scores between automated and manual segmentations as a measure of automated segmentation accuracy; (b) determining whether amygdala volume differences between groups detected for manually segmented amygdala, could be detected using automated methods; (c) producing Bland-Altman plots to indicate estimation bias based on amygdala size; (d) comparing volumes of left and right amygdala to indicate potential hemispheric bias in volumes produced by automated methods; and (e) producing sample size estimation graphs, to provide an indication of relative sample sizes required to detect group differences in amygdala volume for each method. Statistical analyses were performed using R (R Core Team, 2018).

| Deviations from normality
All amygdala volumes were compared in native space. We tested amygdala volumes and Dice scores for normality, skew, and kurtosis, and found that roughly one-sixth of the data were not normally distributed. Outliers were also present. We did not transform the data or correct outliers because we wanted to depict the observed distributions of volumes provided by each method as accurately as possible.
Due to these violations of normality, we employed nonparametric statistics in all statistical comparisons.

| Dice overlap scores
Dice scores (Dice, 1945) are used to indicate the accuracy of segmentation with reference to a "true" segmentation-in this case, manual segmentation, by measuring the proportion of overlap between segmentations. Dice scores range between 0 (no overlap) and 1 (complete overlap). We obtained Dice scores using the 'overlap' function in Convert3D (www.itksnap.org/c3d/).

| Statistical tests of differences in volumes and Dice scores between and within groups
Differences between pairs of measurements within subjects, such as comparisons of left and right hemisphere volumes, were tested using Wilcoxon signed-rank tests (using the 'wilcox.test' function in R). We calculated standardized effect sizes (denoted by r) for within subject differences using r = z/sqrt(number of observations; Field, 2013

| Plots of amygdala volumes
We generated scatterplots of volumes between segmentation tech- Bland-Altman plots (Bland & Altman, 1986), which indicate the difference in estimation between two methods (i.e., between manual and automated segmentation here), were generated to provide an indication of possible bias in volume estimation. Similarly to the approach of Schoemaker et al. (2016), we used the manually segmented amygdala volumes on the X-axis (see also Krouwer, 2008, for justification of this method), and included regression lines to assist with interpretation.

| Sample size estimation
Sample size estimation was performed using the R package 'pwr' and was based on two-tailed, two-sample t-tests. We computed estimates for the size per group of the sample for a range of effect sizes, expressed as amygdala volume difference in mm 3 , for each segmenta- 3 | RESULTS

| Amygdala volumes and volume differences between segmentation methods
All automated methods overestimated amygdala volumes when compared to manually segmented volumes. Visual inspection indicated that amygdala segmentations extended further anteriorly for all automated methods compared with manual segmentations (Figure 1). This difference was most marked for ANTS/FIRST, followed by FreeSurfer, then FIRST ( Figure 2 and Table 2). Density plots of amygdala volumes are provided in Figure S1.

| Dice coefficients based on segmentation type
We sought to examine the extent of overlap between amygdala segmentations produced manually and those produced automatically, by calculating Dice scores (Dice, 1945).  Abbreviations: '% Diff' = percentage difference in amygdala volumes between manual segmentations and automated segmentations (SD in parentheses).
'All' = data for all groups combined; 'Con' = controls; 'Hem' = hemisphere; 'Pre' = presymptomatic HD; 'Symp' = symptomatic HD. 3.3 | Differences in amygdala volume between groups detected for volumes derived via each segmentation method Next, we examined the extent to which group differences in amygdala volumes (controls vs. pre-HD, controls vs. symp-HD, and pre-HD vs. symp-HD) were found for different segmentation methods, using Wilcoxon rank sum tests (FDR-corrected for multiple comparisons).
Manual segmentation provided volumes that allowed groups to be most successfully differentiated. Significant differences in manually segmented volumes were found between controls and pre-HD, and controls and symp-HD, for both the right and left amygdala. Manually  Table S1.

| Associations between manually and automatically segmented volumes, and assessment of estimation bias
Intraclass correlation coefficients (i.e., the appropriate measure of correlation for measurements within subjects), could not be computed because the data were nonparametric. We therefore plotted the data to illustrate the associations between automatically and manually segmented amygdala volumes for each segmentation method (Figure 4). From visual inspection, these associations appear strongest for FreeSurfer, then ANTS/FIRST, and weakest for FIRST. For FIRST, the regression line between automated and manually segmented volumes appears to be different between groups, particularly in the right hemisphere where the intercept of the regression line for controls appeared lower than that for symp-HD. The relationship between manually and automatically segmented volumes for each automated method was explored further using Bland-Altman plots ( Figure 5), which are used to indicate estimation bias.
The negative slopes of the regression lines in each panel indicate that all automated segmentation approaches produced an estimation bias: overestimation of volumes was more severe for smaller amygdalae, and less severe for larger amygdalae. This bias was relatively small for FreeSurfer, though quite marked for FIRST, and somewhat reduced for ANTS/FIRST compared to that for FIRST.
F I G U R E 3 Amygdala volumes in mm 3 based on segmentation type in controls (red plots), pre-HD (green plots), and symp-HD (blue plots). Left panel: left amygdala. Right panel: right amygdala. Boxplot center hinge indicates median, and top and bottom indicate 25th and 75th percentiles. Whiskers extend to the furthest value within ±1.5 × the interquartile range. Outliers (outside of ±1.5 × the interquartile range) are indicated by red asterisks. ***Wilcoxon rank sum test indicated a significant difference in volume between groups with p < .05 (FDR corrected), ** p < .01, *** p < .001

| Right versus left amygdala volume comparisons within segmentation techniques
We compared left amygdala volumes with right volumes within each segmentation type for all data, and then for each HD group. Wilcoxon signed rank tests indicated amygdala volumes segmented using FreeSurfer were statistically significantly larger in the right hemisphere than the left hemisphere, for all groups combined (p < .001, FDR corrected) and for each group separately (all p < .01). Complete statistics for these comparisons are listed in Table S2. Right amygdala volumes were on average 7.6% larger than left (5.5% for controls, 6.4% for pre-HD, and 9.4% for symp-HD). No hemispheric volume differences were found for FIRST, ANTS/FIRST, or manual segmentations.

| Sample size estimation
Sample size estimates for detection of amygdala volumes between groups, for a range of effect sizes (indicated by difference in amygdala volume between groups in mm 3 ), are shown in Figure 6. Graphs in Figure 6 are based on statistics for left hemisphere amygdala volumes, F I G U R E 4 Scatterplots of automated versus manual amygdala volume in mm 3 for each segmentation approach, with regression lines based on linear models for controls (red plots), pre-HD (green plots) and symp-HD (blue plots). Shaded areas are 95% confidence intervals. Top row: left amygdala. Bottom row: right amygdala as described in Section 2.4.5. Graphs based on statistical comparisons of right hemisphere volumes are provided in Figure S2.
Sample size estimation indicated that for all comparisons, the sample size required to detect amygdala volume differences between groups was smallest for manual segmentation, and largest for FIRST.  Figure S2). Vertical black lines indicate observed mean amygdala volume difference between groups in mm 3 . The Y-axis limits are sample sizes comprising between n = 0 and n = 125 per group. This upper limit was chosen for clarity of visualization more anteriorly than the boundary specified by Velakoulis and colleagues, resulting in amygdala segmentations of the current data being more extensive anteriorly. However, this cannot be confirmed on the basis of the provided protocols (see Supporting Information for complete protocol descriptions).
A potential caveat relating to manual segmentation in HD is that atrophy in the amygdala, surrounding or widespread regions may be visibly noted on the scans. This may compromise the blinding of those performing manual segmentation to participants' group membership, which could potentially lead to a systematic bias in amygdala volume between groups. Although this possibility cannot be eliminated for the current data, the tracing protocol provides clear anatomical boundaries, and inter-rater reliability was found to be high, so we are reasonably confident in the accuracy of the segmentations in the presence of atrophy.
Quantitatively, we indicated accuracy of automated segmentations by calculating Dice scores, which represent the proportion of overlap between label images produced by manual and automated segmentation. Dice scores ranged between 0.6 and 0.65, and did not statistically differ between automated segmentation approaches.
These unimpressive scores are not surprising considering that: (a) the amygdala is a challenging structure to segment, so high automated labeling accuracy would not be expected, and; (b) the automated methods greatly overestimated amygdala volume, thus the proportion of overlap between a given automated (large) label, and the corresponding manual (small) label, would be small because much of the automated label extends outside of the manual label. Accordingly, the more extensive volume overestimation produced by FreeSurfer and ANTS/FIRST, compared to that for FIRST, may have also reduced the average Dice scores for these approaches compared to that for FIRST. Since no automated technique produced segmentations that markedly altered Dice scores, this metric may not be the most useful indicator of segmentation accuracy for this data.
In the context of clinical studies, the ability to accurately detect existing volume differences between HD groups and controls, and between pre-and symp-HD, may be the most useful criterion for assessing which segmentation approach to employ. Here, manual segmentation produced volumes that were most easily differentiated between groups, with controls readily differentiable from both symp-HD and pre-HD in both the left and right amygdala. Manual segmentation also produced right amygdala volumes that were statistically different between pre-HD and symp-HD when uncorrected for multiple comparisons, but not with FDR correction. Other methods did not differentiate amygdala volume in pre-HD and symp-HD. Therefore, in studies where this distinction is important, manual segmentation and a slightly larger sample size may be necessary. Furthermore, in order to more closely characterize where in the amygdala volume differences occurred between groups or time points, the use of shape analysis may be beneficial. FreeSurfer was second most effective at differentiating amygdala volumes between groups, producing segmentations that could distinguish controls from either of the HD groups in left amygdala, and could distinguish controls from symp-HD in the right amygdala. Where manual segmentation is not feasible, our findings indicate that FreeSurfer is the next most effective method at producing amygdala volumes that preserve differences between groups. FIRST produced volumes that were not statistically different between groups, so we do not recommend using FIRST for segmentation of amygdala in HD in samples of the current size. Incorporating ANTS nonlinear registration with FIRST segmentation only slightly improved the ability to detect differences in amygdala volumes between groups, resulting in a volume difference between controls and symp-HD only in left amygdala.
The unfavorable results for FIRST may be relevant for the inter- where the average differences between manual and FIRST-derived volumes (as per Table 2) for this group were 96% (left amygdala) and 93% (right amygdala). By contrast for pre-HD differences were 89% (left) and 88% (right), and for controls 61% (left) and 60% (right). This bias appears to be a major factor contributing to the inability to detect differences in volumes between groups for segmentations produced with FIRST.
In terms of the methodological mechanism of this bias, speculatively, it is possible that FIRST's model could not accurately conform amygdala meshes to amygdalae that were abnormally small due to atrophy. The Bayesian modeling approach employed in the default FIRST pipeline allows shape meshes of each structure to deform further than the boundaries of the structures in the training set, to fit the observed anatomy (Patenaude et al., 2011). Feng et al. (2017) suggested that, particularly in cases of brain abnormality, the use of linear rather than nonlinear registration in the initial steps of FIRST's pipeline could result in a structure in the model being inaccurately aligned with the same structure in the target data, in ways that the mesh deformation cannot fully correct for. Feng and colleagues improved segmentation accuracy with FIRST by incorporating initial nonlinear transformations and additional quantitative susceptibility mapping data. In this study, we performed initial ANTS nonlinear registration to the MNI template in an attempt to reduce the distance between the model and the underlying (albeit bias corrected, nonlinearly transformed, and resampled) anatomy. This step appears to have reduced some of the differences apparent between manually segmented volumes and those produced by FIRST, as can be seen in Figure 4. However, any improvement in mesh fitting conferred by this nonlinear registration did not prevent significant overall volume overestimation. The bias toward more severe overestimation of smaller amygdala also remained, but was somewhat reduced. This reduction in bias, in turn, may have led to slight improvement in ability of ANTS/FIRST to differentiate between groups. As mentioned by Perlaki et al. (2017), continued evaluation of FreeSurfer and FIRST in future will be useful, as they are actively developed. Amygdalaspecific segmentation techniques such as those by Collins and Pruessner (2010), Hanson et al. (2012), and Saygin et al. (2017) should also be evaluated in HD, and may provide promising alternatives in HD studies investigating amygdala structure.
FreeSurfer segmentations contained an additional bias where right amygdala segmentations were larger than left amygdala segmentations. This bias was seen in control participants as well as in the HD groups, indicating that this result was not indicative of lateralized atrophy. We did not find a lateralized volume bias for any of the other segmentation methods, including manual segmentation. This suggests that it is unlikely to be due to a genuine volume difference, which manual segmentation should have detected. With respect to lateralization in HD, although there are isolated reports of left lateralized atrophy in the striatum (Minkova et al., 2017;Mühlau et al., 2007) and cortex (Rosas et al. (2002), HD is not considered a lateralized disorder and there is no strong evidence for a hemispheric bias in HD neurodegeneration. It is also unlikely to be due to image artifact, unless a subtle artifact was present that solely affected segmentation by Another factor to consider in choosing the most appropriate segmentation method is sample size, which affects statistical power. For the current data, sample size estimation indicated that in order to reproduce the statistical differences that were observed in manually segmented amygdala volume between groups using automated methods, group sizes would need to be substantially larger. These differences were less pronounced for large effect sizes, such as the notable volume differences between symp-HD and controls, which reflects the more advanced atrophy in symp-HD. By contrast, the difference in sample size required to differentiate amygdala volumes between groups was particularly marked for comparisons between pre-HD and symp-HD. If p = .05 and power = 0.8 were assumed, sample sizes over 10 times larger would be required for automated methods. The substantially larger sample size required for comparison between different stages of HD reflects the subtle changes in amygdala atrophy that occur as the disease progresses from presymptomatic into symptomatic stage (Ahveninen et al., 2018).
Sample sizes required to delineate between controls and symp-HD were smaller for ANTS/FIRST than for FreeSurfer. This appears unintuitive considering that FreeSurfer was able to statistically differentiate amygdala volumes between these groups more effectively than ANTS/FIRST. Interpretation of this discrepancy may be assisted by considering the variances of each subset of the data presented in Figure S1. This figure illustrates that automated methods incur greater variance and irregular distributions due to labeling errors, which may then affect statistical comparisons. We calculated sample sizes using parametric statistics. We had established that although the majority of subsets of volume measurements in the current data were normally distributed, one sixth were not. Therefore, sample sizes we provide here should be interpreted as approximate indications rather than prescriptive.

| Summary and conclusions
The current study utilized a large and balanced sample of individuals with HD and controls, for which manual segmentation of amygdala was performed. Manual segmentation provided gold standard volumetric data against which to assess existing automated segmentation protocols, and one experimental method. We found that manual segmentation is the most optimal method of amygdala segmentation in HD, producing volumes that were most easily differentiated between groups. Manual segmentation may be necessary in studies aimed at detecting amygdala volume differences between individuals with pre-HD and symp-HD, though a slightly larger sample size may be needed.
FreeSurfer performed better than other automated methods on some measures and may constitute a favorable automated alternative. However, the introduction of a potential hemispheric bias in volume estimation may be problematic in studies investigating lateralization of amygdala volume change in HD. FIRST produced volumes that were closer in absolute volume to manual segmentations, but more strongly overestimated the volume of smaller amygdalae, and performed poorly in terms of differentiating amygdala volume between groups.
Performing initial ANTS nonlinear registration with FIRST only somewhat improved accuracy compared to FIRST alone. When choosing segmentation methods for the amygdala in HD, options should be considered in context of the aim of the analysis, and the sample size available. The current data provide information to this end, and may also be informative in interpreting existing volumetric findings regarding the amygdala in HD.

ACKNOWLEDGMENTS
We acknowledge the contribution of all the participants who took part in the IMAGE-HD study. We thank the Royal Children's Hospital

DATA AVAILABILITY STATEMENT
The derived data that support the findings of this study are available on request from the corresponding author. The MRI data are not publicly available due to ethical restrictions.