Reliability of a functional magnetic resonance imaging task of emotional conflict in healthy participants

Abstract Task‐based functional neuroimaging methods are increasingly being used to identify biomarkers of treatment response in psychiatric disorders. To facilitate meaningful interpretation of neural correlates of tasks and their potential changes with treatment over time, understanding the reliability of the blood‐oxygen‐level dependent (BOLD) signal of such tasks is essential. We assessed test–retest reliability of an emotional conflict task in healthy participants collected as part of the Canadian Biomarker Integration Network in Depression. Data for 36 participants, scanned at three time points (weeks 0, 2, and 8) were analyzed, and intra‐class correlation coefficients (ICC) were used to quantify reliability. We observed moderate reliability (median ICC values between 0.5 and 0.6), within occipital, parietal, and temporal regions, specifically for conditions of lower cognitive complexity, that is, face, congruent or incongruent trials. For these conditions, activation was also observed within frontal and sub‐cortical regions, however, their reliability was poor (median ICC < 0.2). Clinically relevant prognostic markers based on task‐based fMRI require high predictive accuracy at an individual level. For this to be achieved, reliability of BOLD responses needs to be high. We have shown that reliability of the BOLD response to an emotional conflict task in healthy individuals is moderate. Implications of these findings to further inform studies of treatment effects and biomarker discovery are discussed.

between 0.5 and 0.6), within occipital, parietal, and temporal regions, specifically for conditions of lower cognitive complexity, that is, face, congruent or incongruent trials. For these conditions, activation was also observed within frontal and sub-cortical regions, however, their reliability was poor (median ICC < 0.2). Clinically relevant prognostic markers based on task-based fMRI require high predictive accuracy at an individual level. For this to be achieved, reliability of BOLD responses needs to be high. We have shown that reliability of the BOLD response to an emotional conflict task in healthy individuals is moderate. Implications of these findings to further inform studies of treatment effects and biomarker discovery are discussed.  (Schimmack & Derryberry, 2005). Studies of neural activation during emotion-cognition interference tasks in MDD may be useful to identify biomarkers. One such emotion-cognition interference task is the emotional conflict task (Egner, Etkin, Gale, & Hirsch, 2008;Etkin, Egner, Peraza, Kandel, & Hirsch, 2006), which includes an emotional Stroop-like condition.
If the biomarker of interest is one that corresponds to treatment response, it is likely that longitudinal investigations will be required.
Longitudinal studies assume that blood-oxygen-level dependent (BOLD) responses to a task are relatively stable within individuals and over time (Fournier, Chase, Almeida, & Phillips, 2014;Nord, Gray, Charpentier, Robinson, & Roiser, 2017). If that is the case, and if treatment is introduced between time-points, changes in BOLD in response to a task could be interpreted as resulting from the intervention. Thus, to facilitate meaningful interpretation of the functional neural circuitry of cognitive and emotional processing over time, it is essential to first understand the test-retest reliability of the BOLD signal.
Reliability is generally accepted to be the consistency of a measure across repeated tests (Noble et al., 2017, p. 5415). Reliability can be assessed by various methods, including intraclass correlation coefficient (ICC), Pearson correlation, coefficient of variation, cluster overlap, or voxel counts (Aurich, Alves Filho, Marques da Silva, & Franco, 2015). In fMRI research, test-retest reliability has primarily been assessed using ICC (e.g., (Caceres, Hall, Zelaya, Williams, & Mehta, 2009;Elliott et al., 2019;Fournier et al., 2014). Studies assessing test-retest reliability of task-based fMRI signals have yielded varied findings. For example, relatively consistent activations over time and between healthy participants have been reported for cognitive paradigms such as a probabilistic classification learning task (Aron, Gluck, & Poldrack, 2006) whereas other studies employing either a rewardrelated guessing task (e.g., Chase et al., 2015), an emotion provocation task using neutral or fearful faces (e.g., Lipp, Murphy, Wise, & Caseras, 2014) or emotional face processing tasks (e.g., Nord et al., 2017) describe low test-retest reliability.
A recent meta-analysis and independent analysis of task-based fMRI data concludes that frequently used fMRI tasks do not show the test-retest reliability necessary for biomarker discovery (Elliott et al., 2019). They report an average ICC value of 0.397 across 90 studies in their meta-analysis, which reflects poor reliability. Independent analyses of 11 commonly used tasks revealed ICCs of <0.3, again indicating rather poor test-retest reliability (Elliott et al., 2019). However, it also has to be noted that poor reliability may not necessarily render a measure unusable as a biomarker, as they may not automatically reflect changes in task performance but individual differences (Hedge, Powell, & Sumner, 2018).
Emotional conflict tasks activate fronto-limbic circuitry, associating amygdala, cingulate, and prefrontal cortices with the generation, monitoring, and resolution of emotional conflict (Egner et al., 2008;Etkin et al., 2006). Reliability studies of tasks assessing emotional processing have shown poor reliability in the amygdala, ventral striatum, and cingulate cortices (e.g., Chase et al., 2015;Nord et al., 2017). If activation within this task is unstable with repeated testing in healthy participants, the analysis of such data for the purpose of defining biomarkers of treatment response may be problematic (Chase et al., 2015;Nord et al., 2017).
In this study, we assessed test-retest reliability of an emotional conflict task in healthy comparison participants collected within a Canadian Biomarker Integration Network in Depression (CAN-BIND) protocol Lam et al., 2016). The Canadian Biomarker Integration Network in Depression (CAN-BIND-1) Program aims to identify biomarkers of antidepressant treatment response in patients with major depressive disorder (MDD). The clinical protocol involves 8 weeks, open-label treatment with the antidepressant escitalopram, which is followed by 8 weeks augmentation with the atypical antipsychotic aripiprazole in escitalopram non-responders (see Kennedy et al., 2019;Lam et al., 2016). Participants were scanned three times (weeks 0, 2, and 8) and changes in activation within fronto-limbic neural circuitry over time were assessed. We used intraclass correlation coefficients (ICC) to quantify reliability. Specifically, we used ICC (3,1) because it treats systematic differences between repeat scans as fixed effects and is thus better suited to characterize biomarkers (Raemaekers et al., 2007). ICCs have been employed in previous fMRI studies assessing test-retest reliability of various experimental tasks, reporting fair reliability at best, with the majority of studies reporting ICC values between 0.33 and 0.66 (Bennett & Miller, 2010;Fournier et al., 2014). Low ICC values reflect low testretest reliability of neural patterns over time. This is concerning because such changes cannot confidently be attributed to being caused by therapeutic effects. Misleading conclusions about the potential utility of using such tasks for the identification of biomarkers of treatment response may subsequently ensue (Fournier et al., 2014;Shadish, Cook, & Campbell, 2002).

| Participants
Fifty-nine healthy control participants were recruited at academic healthcare centers across Canada as a subset of participants in the first Canadian Biomarker Integration Network in Depression study (CAN-BIND-1; Lam et al., 2016)). Participants were aged between 18-60 years, had no psychiatric or unstable medical diagnoses and sufficient fluency in English to complete all study procedures. Demographic information for participants is listed in Table 1. Ethical approval was obtained from institutional ethics boards at each site, and informed consent was obtained from all participants.
Useable neuroimaging data were available for 43 participants. Neuroimaging data were deemed unusable if it did not pass manual quality control (see MacQueen et al., 2019). Reasons for excluding scans were: excessive motion (n = 4), incomplete scan sequence (n = 1), severe ghosting or other data quality issues (n = 6). For n = 5, taskbased fMRI data was missing at one of the three time-points or participants withdrew. In addition, for six participants the behavioral data were either not useable (e.g., reaction times were not recorded, no responses were made/recorded) (n = 1) (n = 3) or did not meet accuracy threshold (see below) (n = 2). This resulted in one site having only one participant contribute to the sample, so this participant was removed from further analysis. Hence, data were analyzed for 36 participants at three time points: weeks 0, 2, and 8. Mean time elapsed between week 0 and week 2 testing was 14.2 (±1.7) days; between week 2 and week 8 it was 42.5 (±3.6) days.

| Design
The emotional conflict task (Egner et al., 2008;Etkin et al., 2006) assesses the cognitive cost that occurs when suppressing task irrelevant information to attend to task-relevant information. The task comprises 148 black and white images (Ekman & Friesen, 1976) of either happy or fearful faces with the words "HAPPY" or "FEAR" superimposed on the images in bold red uppercase lettering. Stimuli were presented using E-Prime software version 2 (https://pstnet.com/ products/e-prime-legacy-versions/) and displayed on a projection screen. Participants viewed the screen via a mirror attached to a head-coil. Stimuli were presented for 1 s and participants were instructed to identify the facial emotional expression via button press, as quickly and accurately as possible. Inter-stimulus intervals, during which participants were instructed to look at a projected fixation cross, varied between 3 and 5 s and were jittered. Images were counterbalanced for equal numbers of congruent and incongruent presentations, in two consecutive "runs." Participants completed the "runs" in the same order in each of their scan sessions to avoid between-session variance associated with order. Each "run" comprised 74 stimuli and lasted 6 min 35 s. Prior to scanning, participants practiced the task outside of the scanner, to demonstrate understanding of task requirements.
Dependent variables were reaction time (RT) and accuracy. For the analysis of RT data error trials, post-error trials (i.e., the trial following an error trial), and trials were RT exceeds two standard deviations above or below the trial type mean were not included.
Commission error threshold was set at 25% per run, and the threshold of total allowable errors (combined omission and commission errors) was set at 30% per run. For the analysis of accuracy, trials were RT exceeded two standard deviations from the trial type mean and posterror trials were included. For the neuroimaging analysis of all trials (e.g., all faces), all trials were included, regardless of RT or accuracy.
Trial types analyzed included all face trials, which included error and post-error trials, incongruent trials, congruent trials, incongruent minus congruent trials, fear minus happy face trials, fear minus happy (c) Identification of outliers using Censoring with outliers being either discarded or replaced with interpolated values from neighboring volumes: Basic censoring (CENSOR) was done by identifying significant outlier volumes in fMRI time series, which are discarded and replaced with interpolated values from neighboring volumes. This is done using the algorithm described in (Campbell, Grigg, Saverino, Churchill, & Grady, 2013) and was first validated for its impact on pipeline optimization by Churchill et al. (2015). It uses a robust sliding time-window approach to identify outlier scans and replaces them with values interpolated from neighboring scans via cubic splines (stand-alone software is available at: nitrc.org/projects/spikecor_fmri). (d) Slice-timing correction (TIMECOR) with Fourier interpolation via AFNI's 3dTshift; (e) Spatial smoothing using AFNI's 3dBlurToFWHM to smooth fMRI images at FWHM = 6 mm in x,y,z directions allowing for different intrinsic reconstructed smoothing levels at each site: To match the spatial smoothing across MRI scanners at different sites, we used the 3dBlurToFWHM module in AFNI to smooth the fMRI images to the smoothness level of FWHM = 6 mm in three directions (x,y,z). Since the FWHM should reflect the spatial structure of the noise, we first regressed out the BOLD response modeled using the canonical hemodynamic response and then used the resultant residual image as the "blur master". The "blur master" controls the process of smoothing on the original image. Blurring is applied to both the original image and the residual until the smoothness of the residual reaches the desired FWHM = 6 mm. Using a smoothing kernel with a FWHM ≈2× (in-slice voxel width) has previously been shown to provide major improvements in prediction and reproducibility for group analysis of an fMRI motor task (S. Strother et al., 2004). (f) Obtaining a binary mask which excludes non-brain voxels using AFNI's 3dAutomask algorithm and applying the resultant mask to all EPI volumes; (g) Neuronal tissue masking: Neuronal tissue masking was performed by estimating a probabilistic mask to reduce the variance contribution of nonneuronal tissues in the brain (e.g., macro-vasculature, ventricles). This step uses the first part of the PHYCAA+ algorithm developed by Churchill and Strother (2013) to estimate task-run and participantspecific neural tissue masks (software available at nitrc.org/projects/ phycaa_plus). (h) Calculation of nuisances regressors to be regressedout from the data concurrently via multiple linear regression: Temporal trends were modeled using a second-order Legendre polynomial basis set, head motion effects on time-series were modeled using participant motion parameter estimates (MPEs) obtained from MOTCOR, ( Step 3).
To obtain motion parameter regressors (MOTREG) per fMRI session, we performed PCA on the six MPE time-courses, and used the largestvariance principal components, which preserved 85% of the variance, as motion regressors. This allowed us to maximize the amount of head motion variance accounted for, while minimizing loss of power and collinearity effects due to unnecessary parameterization.
Preprocessed fMRI output files generated by these preprocessing steps were in their original BOLD scan's brain space (Native_processed-fMRI) but then transformed into normalized space (sNORM_processed_fMRI) to allow participants' results to be com-

| Analysis of behavioral data
Statistical analyses of the behavioral data were completed using SPSS 25 (IBM Corporation, 2017). Demographic data were analyzed using one-way analyses of variances (ANOVA) for age and education and with chi-square tests for sex and handedness. Behavioral data were analyzed using the Shapiro-Wilk test, to assess normality of the con- minus cI trials were generated. It is thought that the comparison of "high conflict resolution > low conflict resolution" identifies regions associated with "conflict resolution" whereas the contrast "low conflict resolution > high conflict resolution" would identify regions implicated either in the "generation of conflict" or the "monitoring of conflict." Second-level analyses across participants (but within each time point, that is, weeks 0, 2, and 8) were conducted in SPM12, using random-effects analyses with the above contrasts. For all conditions and contrasts tested, a cluster-level threshold to control for multiple comparisons was set at p < .05-family-wise-error (FWE)-corrected, with a cluster size of 10 or more voxels.

| Reliability analyses
Test-retest reliability of brain activation was assessed using intraclass correlation coefficients (ICC) (Shrout & Fleiss, 1979), a standard method to quantify the reliability of measurements between multiple test sessions (Bennett & Miller, 2010). ICC describe the stability of inter-individual differences in brain activation over time, assessing within-subject variance (σ within) relative to between-subject variance (σ between): ICC 3, 1 ð Þ= σ 2 between − σ 2 within σ 2 between + σ 2 within Variance components were calculated by the individual contrast values separately for each trial type, and for each time point, that is, weeks 0, 2, and 8. Participants were treated as random effects and sessions (time-points) were treated as fixed effects. ICCs can be interpreted as a ratio of variance (Bartko, 1966). ICCs approaching 1.0 suggest near-perfect agreement between test and re-test measurements, that is, relative neural activation is consistent across time-points, whereas ICCs approaching 0 suggest no reliability. A negative ICC reflects a reliability of zero (Bartko, 1966), and can occur when the within-group variance exceeds the between-group variance (Lahey, Downey, & Saal, 1983). We assessed reliability using ICC (3,1), a measure of relative reliability as is appropriate for multi-site fMRI data (Forsyth et al., 2014). ICC (3,1) measures the consistency between the repeated measurements, not the absolute agreement between them.
ICCs were calculated for each voxel using the MATLAB-based ICC toolbox (Caceres et al., 2009). Following Caceres et al., the median ICC for each cluster was considered as the primary reliability statistic of interest for that particular region. Median ICC was extracted for each significantly activated cluster, that is, based on the activity contrasts rather than ICC maps. Reliability was classified as "poor" (ICC < 0.4), "moderate to good reliability" (ICC = 0.4 to 0.75) or "excellent" (ICC > 0.75) (Nord et al., 2017).
We conducted whole-brain analyses using the neuromorphometric atlas as implemented in SPM, to explore whether additional regions that may not have survived significance thresholds set in our analysis of con-  Table 3). Although there were no differences in RT between happy and fear faces at week 0, participants were significantly faster in responding to happy relative to fear faces at weeks 2 and 8 (see Table 4). Accuracy rates did not differ for happy or fear faces at any of the time points. Sex differences were not assessed due to small sub-group numbers.
The ICCs of the behavioral measures are listed in Table 5. None of these demonstrated excellent reliability (i.e., ICCs of 0.8 or above), but several of these measures demonstrated good reliability, with ICCs being between 0.5 and 0.6. 3.2 | Neuroimaging data

| All faces trials
The voxel-wise ICC estimates across the three time-points were moderate, at best, for regions showing significant BOLD activation (p FWEcorrected = .05) for the "all faces trials" contrast (see Table 5 and Figure 1). We observed moderate reliability (median ICC = 0

| Congruent face trials
Significant activation (p FWE-corrected = .05) was observed in the fusiform gyrus and the middle temporal gyrus, as well as the right precuneus and the left post-central gyrus for congruent faces (see Table 5 and Figure 2).
Neural activation within these regions showed moderate reliability, with median ICCs = 0.5. Additionally, significant activation (p FWE-corrected = .05) was observed within frontal and sub-cortical regions such as the right precentral gyrus and the left cingulate gyrus and the left insula. In these areas, ICC values were 0.4. Poor reliability was observed for significant activation within the right thalamus, pyramis, insula, the right postcentral gyrus and the right superior frontal and right superior temporal gyri, as well as the left insula and the left parahippocampal gyrus.

| Incongruent face trials
The left middle temporal gyrus, right precuneus and right superior parietal lobule had significant activation (p FWE-corrected = .05) for incongruent faces (Table 5, Figure 3). Reliability was moderate in these regions, with median ICCs = 0.5. Significant activation was also apparent in the right inferior frontal gyrus, the superior temporal gyrus and within the thalamus, but ICC values for these regions were modest at 0.4. Poor reliability was observed within the left insula and the right postcentral gyrus, median ICC = 0.3.

| Fear minus happy face trials
The voxel-wise ICC estimates for more cognitively demanding comparisons, specifically for the fear minus happy contrasts for both trials with the word fear minus the word happy and trials with a fear face minus happy face conditions, were very poor (median ICC = 0.1; Table 5).
For the fear word minus happy word contrast, significant activation (p FWE-corrected = .05) was observed within the right supramarginal gyrus and the right inferior parietal lobule.
For the fear face minus happy face contrasts, significant activation (p FWE-corrected = .05) was observed within right superior and middle temporal gyri.

| Incongruent minus congruent trials
For the incongruent minus congruent contrast, significant activation (p FWE-corrected = .05) was observed for the right inferior parietal

| Exploratory analyses: Whole-brain analyses using neuro-morphometric atlas
The voxel-wise ICC estimates across the three time-points for the "all faces" contrasts were most reliable within visual regions, including the calcarine cortex, the lingual gyri, cuneus, and the inferior occipital gyri, all bilaterally, all with ICC = 0.6, and within the occipital fusiform gyri

| DISCUSSION
The aim of this study was to examine the reliability of the BOLD signal for an emotional conflict task as a prerequisite to assessing the task's suitability to establish biomarkers of treatment response in clinical populations. Comparing across three time-points, weeks 0, 2 and 8, we observed moderate reliability (median ICC values between 0.5 and 0.6) within occipital, parietal and temporal regions, specifically for conditions of lower cognitive complexity, such as all faces, and congruent or incongruent trials relative to baseline. Activation was also observed within frontal and sub-cortical regions for the same conditions, but the median ICC values were poor. We did not observe "good" or "excellent" reliability for any regions. Median ICC values of 0.5 and 0.6 were also calculated for the lingual gyri, cuneus and occipital fusiform gyri for less cognitively demanding conditions, whereas poor reliability was observed for contrasts demanding more cognitive processing when using the neuromorphometrics template in exploratory analyses.
Our findings are consistent with previous reports (Chase et al., 2015;Fournier et al., 2014;Lipp et al., 2014;Nord et al., 2017) as well as a current meta-analysis (Elliott et al., 2019). Elliott et al. (2019) reported an average ICC of 0.397 for unthresholded ICC estimates  (Mechelli, Humphreys, Mayall, Olson, & Price, 2000). The nature of trials in our task, congruent but also incongruent, may have contributed to activation of this region.
The cuneus is a primary visual area involved in response inhibition by contributing to motor responses rather than error monitoring (Booth et al., 2005;Haldane, Cunningham, Androutsos, & Frangou, 2008;Matthews, Simmons, Arce, & Paulus, 2005 moderate reliability, primarily in response to the all face condition or the congruent or incongruent conditions. The precuneus has a role in visuo-spatial imagery, episodic memory retrieval and self-processing operations (Cavanna & Trimble, 2006) and has been implicated in this task during the contrasting of congruent and incongruent conditions (Fournier et al., 2017). Fournier and colleagues explained that activation within this area is linked to switching between easier, congruent trialswhich may reflect default processing (given the precuneus's role as a node of the default mode network)to more complex, incongruent trials (Fournier et al., 2017).
The superior parietal lobule is involved with processing visual information as it relates to spatial orientation (Corbetta, Kincade, Ollinger, McAvoy, & Shulman, 2000). Temporal regions such as the middle temporal gyrus showed significant activation in response to all faces but also congruent and incongruent conditions. This region is involved in the recognition of known faces but has also been linked to accessing word meaning while reading (Acheson & Hagoort, 2013).
Again, given that the stimuli used in this study comprise both faces and words, it is not surprising that activation in these areas was observed.
We detected activation in brain areas expected to be implicated in this task, for example, inferior and orbital prefrontal cortical (PFC) regions such as Brodmann area (BA) 44 and BA47, dorso-lateral and anterior PFC regions such as BA46 and BA10, in addition to subcortical structures, such as the cingulate gyri, the insula and the parahippocampus/amygdala (see Table 6). Within these regions, and in particular in response to cognitively more complex contrasts, such as "iI < cI" or "incongruent < congruent" trials, reliability was poor (median ICC ≤ 0.1). Activation in BA44 has previously been linked to selective response suppression in response-inhibition tasks, such as a go/no-go task (Forstmann, van den Wildenberg, & Ridderinkhof, 2008) as well as to hand-movements (Rizzolatti, Fogassi, & Gallese, 2002). The dorso-lateral and anterior prefrontal regions are implicated in task-aspects such as sustained attention and executive processing.
Both insula and cingulate subserve the task employed here. Egner and colleagues reported activation of the cingulate regions for the iI minus cI contrast, which is also referred to as conflict monitoring. Here we also observed significant activation in both the dorsal anterior and posterior cingulate in response to "iI versus cI" trials, that is, highconflict versus low-conflict trials (Egner et al., 2008). However, ICC values for these regions were poor.
Insula activation was observed in response to both incongruent and congruent trials, as well as in the all faces versus baseline condi- Aronson, Nystrom, & Cohen, 2003). It is also an important node in the salience network, in relation to response selection and selective attention. Insula activation was previously reported in response to incongruent minus congruent trials in an MDD group (Fournier et al., 2017).
The lack of convincing reliability observed here in a substantial number of conditions and contrasts has significant implications for studies using task-based fMRI to identify biomarkers of treatment response for any psychiatric disorder, not just depression. For the most part, we observed activation in regions that are both understandable given the nature of the task and consistent with previous reports using this task. To that extent, the task met expectations in healthy comparison participants. Nonetheless, reliable activation of key regions across repeated testing in healthy participants was not apparent. Our results are in line with the conclusions of a recent meta-analyses and subsequent confirmatory findings of poor testretest reliability in a variety of fMRI-based tasks (Elliott et al., 2019).
This suggests that the suitability of such a task for uncovering biomarkers of treatment response in any patient population using repeated measures is questionable, as any associations with treatment would have to be distinguishable from fluctuations in activation that appear to be inherent to the task.
These results are consistent with test-retest reliability studies of other task-based and resting-state fMRI, which have reported reliability in the poor to good range (Chase et al., 2015;Fournier et al., 2014;Lipp et al., 2014;Nord et al., 2017;Plichta et al., 2012;Shah, Cramer, Ferguson, Birn, & Anderson, 2016;Shehzad et al., 2009;Shou et al., 2013). In our cohort, reliability was better (i.e., moderate) in cortical regions, but typically poor in sub-cortical structures, corresponding to previous reports (Fournier et al., 2014). Non-cortical regions may overall be less reliable (Shah et al., 2016) because of the smaller sizes of sub-cortical structures (Noble, Spann, et al., 2017). Furthermore, it has been reported that the magnitude of ICC is influenced by the complexity of the functional contrasts investigated (e.g., Brown et al., 2011).
We observed that reliability was higher for conditions that were less cognitively demanding (e.g., all faces versus baseline, congruent or incongruent trials relative to baseline) than for contrasts that were related to higher cognitive demand, for example, conflict monitoring or the emotional Stroop effect. Trials of less cognitive complexity are thought to retain more of the BOLD signal relative to higher cognitive complexity contrasts for which potentially larger subtractions of neural activity result in less BOLD signal, subsequently reducing ICC (Brown et al., 2011). Thus, considering a trade-off between the complexity of the model, or contrast, and its interpretability is important when assessing test-retest reliability, especially when an intervention is introduced between scans. Direct comparison of active contrasts (e.g., incongruent minus congruent) showed poor reliability also in other neuroimaging studies (e.g., [Infantolino, Luking, Sauder, Curtin, & Hajcak, 2018]); however, main effects, or conditions, such as congruent or incongruent showed comparatively better reliability. Furthermore, observed differences in reliability may also be related to the nature of the task and more so to the similarity of its trials. For example, the fear versus happy face contrasts may show similar variance or correlations which may subsequently appear less reliable, see Hedge et al. (2018), who states that measures may be less reliable when highly correlated or of similar variance. Indeed, assessments of ICCs for the behavioral equivalent of the neuroimaging contrasts (e.g., accuracy for fear vs. happy; incongruent vs. congruent) showed moderate reliability (0.45 and 0.58, respectively). This is comparable to the ICCs of the behavioral data for the overall conditions. However, for reaction time, ICC for the incongruent versus congruent contrast was reduced relative to the overall conditions, further supporting findings of Infantolino et al. (2018) and Hedge et al. (2018), that in addition to examining the reliability of neural measures, the behavioral measures should be assessed as well. Additionally, the fact that some conditions (e.g., all faces, congruent or incongruent) had a larger number of trials than other contrasts, such as conflict monitoring, needs to be considered; poor reliability in those trials may, at least in part, relate to low statistical power (Brown et al., 2011).
Whole-brain neuromorphometric analyses suggested that the most reliable regions (e.g., visual/occipital regions) may not be regions related to task-relevant activation (e.g., cingulate cortex). Noble and colleagues, using resting state fMRI, reported similar observations, stating that the most reliable edges are not necessarily the most informative ones, and vice versa . Likewise, Plichta et al. (2012) reported that in response to three different tasks (a reward task, a faces task and an n-back task) the voxels responding most strongly were also not necessarily the ones showing the most reliable pattern of activation, and that voxels showing high ICC values were observed in regions not necessarily engaged by the task. Therefore, it may be possible that meaningful information unique to each individual could be captured by data with relatively low test-retest reliability.
This may, however, hinder the development of fMRI predictive biomarkers of treatment response .

| Limitations
The methods employed to assess reliability need to be taken into consideration when interpreting our findings in the wider context. For one, results are reported for the reliability of clusters obtained from univariate GLM statistics. It has previously been reported that not only modeling approaches chosen for analyses (Fournier et al., 2014), but also the selection of preprocessing steps (Churchill et al., 2015) affect signal detection and subsequently test-retest reliability. Studies, primarily using resting-state fMRI data, have shown that pre-processing parameters, such as censoring based on outliers within functional time-series, impact reliability estimates of connectivity measures (e.g., Aurich et al., 2015). Evaluating the effect of pre-processing pipelines on reliability measures of task-based fMRI data could thus be of future interest.
We employed an FWE-corrected threshold of p < .05, which has previously been regarded as too lenient (Eklund, Nichols, & Knutson, 2016). Most clusters, however, were significant at p < .001, as evident in Table 6. Multi-variate assessments of test-retest reliability have previously also been shown to improve reliability measures over univariate methods  and should thus be explored further in future.
Secondly, the measure employed here to assess test-retest, ICC, is a statistical estimate of reliability rather than a direct marker of test-retest stability. It can be affected by factors other than the underlying stability of the BOLD signal (Fournier et al., 2014). Furthermore, we have not analyzed the breakdown of the ICC variance and may therefore not completely be able to rule out additional effects of time on ICC measures.
In addition, homogenous samples, such as our group of healthy adult participants, may have reduced ICC estimates (Bennett & Miller, 2010), because of the way that ICCs are calculated. Assessing reliability in patient populations might provide additional tests of the utility of fMRI biomarkers in treatment response research, but as the CAN-BIND study included an intervention for all patient participants, it was not possible to assess this in CAN-BIND.
We performed whole-brain approaches whereas previous studies assessing test-retest reliability mostly used region of interest analysis (e.g., Chase et al., 2015;Nord et al., 2017). Restraining our analyses to a priori defined regions might have improved ICC but our exploratory analysis using the neuromorphometrics template assessed amygdala and cingulate regions and showed poor reliability, comparable to previous observations (e.g., Nord et al., 2017).

| CONCLUSION
In this study, the reliability of the BOLD signal in regions subserving an emotional conflict task was poor to moderate, despite behavioral and activation measures suggesting that the task performed as expected at all three time points. These results are consistent with other reports. Clinically relevant prognostic markers based on taskbased fMRI would require high predictive accuracy at an individual level and for this to be achieved, BOLD responses need to be highly reliable. Typical analyses of task-based fMRI of cognitive-emotional processes therefore appear to lack the reliability required to uncover biomarkers of treatment response in longitudinal clinical studies.
However, it should also be considered that low reliability of taskbased fMRI markers may not necessarily mean their potential in biomarker discovery is lost, but that the reasons for this lack of consistency would need to be evaluated appropriately (Hedge et al., 2018).
Novel analytic methods may be required to determine whether these tasks have utility as predictive tasks in clinical trials. Servier. Funding and/or in-kind support was also provided by the investigators' universities and academic institutions. Officer of ADMdx, Inc., a neuroimaging consulting company.

DATA AVAILABILITY STATEMENT
Data subject to third party restrictions. The data that support the findings of this study are available from the Ontario Brain Institute at a later date. Restrictions apply to the availability of these data, which were used under license for this study. Data will be made available with the permission of OBI at a later date.