Erratum: Exploring the impact of analysis software on task fMRI results

Exploring the impact

For each study we reproduced the main group-level effect from each publication using each of the three main fMRI software packages: AFNI, FSL, and SPM. We then applied a variety of quantitative and qualitative comparison methods to assess the similarity of the statistical maps between the three software packages.
From further work we are currently carrying out on this topic, we have recently become aware that five out of the 14 analysis results used in the article contained errors.
The first of these two results were our AFNI nonparametric reanalyses of the Schonberg et al., 2012 (ds000001) and Moran et al., 2012 (ds000109) datasets. In both cases, the wrong sub-bricks of the 4D subject-level results files had been specified in the group-level permutation test model, meaning that permutation tests were carried out on subject-level statistic images rather than subject-level beta images as was intended.
A similar problem also was found for our AFNI parametric analysis of the Padmanabhan et al., 2011 (ds000120), where again, the subject-level statistic images were wrongly entered into the group-level mixed-effects analysis rather than the intended beta images.
The final two results were our FSL parametric and nonparametric reanalyses of the Moran et al., 2012 (ds000109) dataset. In both cases, the linear model contrast had been incorrectly specified; the FSL parametric and permutation results in the article were for the False Photo Question vs False Belief Question contrast rather than the intended False Photo Story vs False Belief Story contrast used in the original publication.
We have now corrected and reanalysed these five sets of results, and updated all of the comparison figures in the article accordingly.
For the nonparametric AFNI analyses, the corrected results have led to only minor changes in the quantitative comparisons that were originally reported. Perhaps most notably, the within-software Dice coefficients comparing the thresholded parametric and nonparametric results obtained within AFNI for both the ds000001 and ds000109 studies are now slightly worse in light of these new results. This is likely to be because the use of statistic images (instead of betas or contrasts) in the permutation group model mimicked a type of mixed effects inference more similar to the parametric mixed effects analysis. For ds000001, the AFNI parametric/nonparametric dice coefficient has decreased from 0.833 to 0.700, and for ds000109, the corresponding dice coefficient has decreased from 0.899 to 0.819.
A similar conclusion also holds for our corrected parametric ds000120 AFNI analysis results, which are now less similar to the corresponding set of results obtained in SPM compared to our original findings. Here, the AFNI-SPM Dice coefficient has dropped from Camille Maumet and Thomas E. Nichols contributed equally to this study. 0.684 to 0.545, and the AFNI-SPM correlation coefficient for the threshold maps has dropped from 0.748 to 0.650.
For the parametric and nonparametric FSL ds000109 analyses, our corrected analyses have led to notable improvements in the ds000109 AFNI-FSL and FSL-SPM inter-software comparisons, in the form of higher correlations between the unthresholded statistical maps and larger Dice coefficients for comparisons of the thresholded statistical maps. Specifically, correlations now range from 0.429 to 0.870 (was 0.429 to 0.747), and Dice coefficients range from 0 to 0.769 (was 0 to 0.684) for between-software comparisons.
Importantly, the main conclusion from the article, that is, that weak effects may not generalise across fMRI software, has not changed in light of these corrections. Since finding these errors, we have investigated and verified all other models as correct and consistent across the three studies and software packages. Abstract p.3362 "However, we also discover marked differences, such as Dice similarity coefficients ranging from 0.000 to 0.684 in comparisons of thresholded statistic maps between software." Should instead appear as: "However, we also discover marked differences, such as Dice similarity coefficients ranging from 0.000 to 0.769 in comparisons of thresholded statistic maps between software."  (Table 3)."

Should instead appear as:
"Pairwise correlations ranged from 0.429 to 0.870 for intersoftware comparisons (Table 3)." p.3376 "These values improve for ds000109, where the mean Dice coefficient for positive activations is 0.512. Here, AFNI and FSL were the only software packages to report significant negative clusters for the ds000109 study. Strikingly, these activations were found in completely different anatomical regions for each package, witnessed by the negative activation AFNI/FSL dice coefficient of 0. Finally, the AFNI/SPM Dice coefficient for the thresholded F-statistic images obtained for ds000120 is 0.684; it is notable that across all studies, the AFNI/SPM dice coefficients are consistently the largest." Should instead appear as: "These values improve considerably for ds000109, where the mean Dice coefficient for positive activations is 0.707. Since AFNI was the only software package to report any significant negative clusters for the ds000109 study, the AFNI/FSL and AFNI/SPM dice coefficient for negative activations is effectively 0 here. Finally, the AFNI/SPM Dice coefficient for the thresholded F-statistic images obtained for ds000120 is 0.684." 3.2 Cross-software variability for nonparametric inference p.3378 'Quantitative assessment with Dice coefficients are shown in p.3381 "While far from perfect, the ds000120 AFNI and SPM thresholded results have the best Dice similarity score, likely due to the use of a very strong main effect as an outcome of interest".

Should be removed.
p.3381 "It is only when analysing these results over the whole brain, that we discover broad differences in these activation patterns, for example, positive activation identified in the auditory cortex in FSL that was not reported by AFNI and SPM, and significant deactivation determined only by AFNI and FSL."

Should instead appear as:
"It is only when analysing these results over the whole brain, that we discover broad differences in these activation patterns, for example, positive activation identified in the cerebellum in AFNI and SPM that was not reported by FSL, and significant deactivation determined only by AFNI." F I G U R E 4 Dice coefficients comparing the thresholded positive and negative T-statistic maps computed for each pair of software package and inference method for each of the three reproduced studies. Dice coefficients were computed over the intersection of the pair of analysis masks, to assess only regions where activation could occur in both packages. Percentage of "spill over" activation, that is, the percentage of activation in one software's thresholded statistic map that fell outside of the analysis mask of the other software is displayed in grey; left value for row software, right value for column software. For ds000001 increases, FSL permutation obtained no significant results, thus generating Dice coefficients of zero Dice coefficients of zero; for ds000109 decreases, no comparisons are shown as only AFNI parametric obtained a result. However, this effectively means the AFNI/FSL and AFNI/SPM Dice coefficient is also zero here. Dice coefficients range between 0 and 0.75, and are commonly below 0.5 for comparisons between software packages. Parametric-nonparametric intrasoftware comparisons are higher, with most of the dice coefficients above 0.8 F I G U R E 5 Legend on next page.
F I G U R E 5 (a) Euler characteristic (EC) plots for ds000001 and ds000109. On top, comparisons of the Euler characteristic computed for each software's T-statistic map from our reanalyses using a range of T-value thresholds between −6 and 6. Below, comparisons of the ECs calculated using the same thresholds on the corresponding T-statistic images for permutation inference within each package. For each T-value the EC summarises the topology of the thresholded image, and the curves provide a signature of the structure of the entire image. For extreme thresholds the EC approximates the number of clusters, allowing a simple interpretation of the curves: For example, for ds000001 parametric analyses, FSL clearly has the fewest clusters for positive thresholds. (b) Cluster count plots for ds000001 and ds000109. On top, comparisons of the number of cluster found in each software's T-statistic map from our reanalyses using a range of T-value thresholds between −6 and 6. Below, comparisons of the cluster counts calculated using the same thresholds on the corresponding T-statistic images for permutation inference within each package F I G U R E 6 Cross -software Bland-Altman 2D histograms for the ds000001 and ds000109 studies comparing the unthresholded group-level T-statistic maps computed using permutation inference methods within AFNI, FSL, and SPM. Similar to the results obtained using parametric inferences in Figure 3, all of the densities indicate large differences in the size of activations determined within each package F I G U R E 7 Intrasoftware Bland-Altman 2D histograms for the ds000001 and ds000109 studies comparing the unthresholded group-level T-statistic maps computed for parametric and nonparametric inference methods in AFNI, FSL and SPM. Each comparison here uses the same preprocessed data, varying only the second level statistical model. SPM's parametric and nonparametric both use the same (unweighted) one-sample T-test, and thus show no differences. AFNI and FSL's parametric models use iterative estimation of between-subject variance and weighted least squares and thus show some differences, but still smaller than between-software comparisons