Reliability of energy landscape analysis of resting-state functional MRI data

Energy landscape analysis is a data-driven method to analyze multidimensional time series, including functional magnetic resonance imaging (fMRI) data. It has been shown to be a useful characterization of fMRI data in health and disease. It fits an Ising model to the data and captures the dynamics of the data as movement of a noisy ball constrained on the energy landscape derived from the estimated Ising model. In the present study, we examine test-retest reliability of the energy landscape analysis. To this end, we construct a permutation test that assesses whether or not indices characterizing the energy landscape are more consistent across different sets of scanning sessions from the same participant (i.e., within-participant reliability) than across different sets of sessions from different participants (i.e., between-participant reliability). We show that the energy landscape analysis has significantly higher within-participant than between-participant test-retest reliability with respect to four commonly used indices. We also show that a variational Bayesian method, which enables us to estimate energy landscapes tailored to each participant, displays comparable test-retest reliability to that using the conventional likelihood maximization method. The proposed methodology paves the way to perform individual-level energy landscape analysis for given data sets with a statistically controlled reliability.

Population-level inferences are a common practice for analyzing brain activity in empirical data.However, both the structure and dynamics of the brain vary even among healthy individuals, let alone among individuals belonging to a disease group due to the heterogeneity of the disease.Therefore, although population-level inferences increase the data size and often help us to reach statistically significant observations, they may yield inaccurate results and loss of information when the observed data are individual-specific.To avoid populationlevel inferences, it is necessary to establish the reliability of individual-level inferences of collective brain dynamics.
Finn et al. examined the role of individual variability in functional networks measured by functional magnetic resonance imaging (fMRI) and its ability to act as a fingerprint to identify individuals [18] (see [19,20] for earlier studies).In order for individual fingerprinting to be successful, the test-retest reliability of the functional network must be higher across sessions obtained from the same individual (i.e., within-participant reliability) than across sessions obtained from different individuals (i.e., between-participant reliability).Indeed, it was found that within-participant reliability was robust and that both restingstate and task fMRI from different sessions of the same individual could be used to perform fingerprinting [21].Other studies also confirmed the ability of functional networks from fMRI data as fingerprints of individuals, including the development of different methods to quantify and improve fingerprinting [22][23][24][25][26].The ability of functional connectivity to act as individual fingerprints has also been confirmed with electroencephalogram (EEG) [27] and magnetoencephalogram (MEG) data [28,29].
Functional networks or its dynamic variants are not the only tools for analyzing brain dynamics or fingerprinting individuals.One way to analyze fMRI or other multidimensional time series data from the brain is to infer dynamics of discrete states.Each state may correspond to a particular functional network [13,15,[30][31][32][33] or a spatial activation pattern [17,34,35], and the transition from one state to another may correspond to a regime shift in the brain.Energy landscape analysis is a method to characterize brain dynamics as a movement of a stochastic ball constrained on an energy landscape inferred from the data [36][37][38].Quantifications of the estimated energy landscapes such as the height of the barrier between two local minima of the energy allow intuitive interpretations; a local minimum of the energy is a particular spatial activity pattern and defines a discrete brain state.A high barrier between two local minima implies that it is difficult for the brain dynamics to transit between the two local minima.Indices from energy landscape analysis have been shown to be associated with behavior of healthy individuals in a test of bistable visual perception task [37,39], executive function [40], fluid intelligence [41], healthy aging [42], autism [43], Alzheimer disease [44], schizophrenia [45,46], attention deficit hyperactivity disorder [47], and epilepsy [48].
These successful applications of energy landscape analysis are likely to owe to advantages of the method compared to other related methods such as functional network analysis and hidden Markov models.For example, with energy landscape anal-ysis, one can borrow concepts and computational tools from statistical physics of spin systems to quantify the ease of state transition by the energy barrier [38] and complexity of the dynamics by different phases (e.g., spin-glass phase) and susceptibility indices [41].In addition, each network state is by definition a binary activity pattern among a pre-specified set of regions of interest (ROIs) and therefore relatively easy to interpret.Despite its expanding applications, the validity of the energy landscape analysis has not been extensively studied except that one can measure the accuracy of fit of the model to the given data [38,[49][50][51][52].A high accuracy of fit does not imply that the estimated energy landscape is a reliable fingerprint for individuals.In fact, if fMRI data are nonstationary, an energy landscape estimated for the same individual in two time windows may be substantially different from each other, whereas the accuracy of fit may be high in both time windows.Furthermore, the original energy landscape analysis method requires pooling of fMRI data from different individuals unless the number of regions of interest (ROIs) to be used is relatively small (e.g., 7) or the scanning session is extremely long.This is because the method is relatively data hungry [38].The concept of individual fingerprinting is unclear when pooling of data is necessary.
The present study is a precursor to being able to assess individual differences.We assess potential utility of energy landscape analysis in individual fingerprinting by investigating its test-retest reliability.Specifically, we ask how much features of the estimated energy landscapes are reproducible across different sessions from the same individual as opposed to across sessions belonging to different sets of individuals.We hypothesize that test-retest reliability is higher between sessions for the same individual than between sessions for different individuals.Code for computing energy landscapes with the conventional and Bayesian methods is available on Github [53].

Midnight Scan Club data
We primarily use the fMRI data in the Midnight Scan Club (MSC) data set [22].MSC data set contains five hours of resting-state fMRI data in total recorded from each of the 10 healthy human adults across 10 consecutive nights.A restingstate fMRI scanning section lasted for 30 minutes and yielded 818 volumes.Imaging was performed with a Siemens TRIO 3T MRI scanner using an echo planar imaging (EPI) sequence (TR = 2.2 s, TE = 27 ms, flip angle = 90 • , voxel size = 4 mm × 4 mm × 4 mm, 36 slices).
The original paper reported that the eighth participant (i.e., MSC08) fell asleep, showed frequent and prolonged eye closures, and had systematically large head motion, resulting in much less reliable data than those obtained from the other participants [22].We also noticed that the accuracy of fitting the energy landscape, which we will explain in section 2.4, fluctuated considerably across the different sessions for the tenth participant (i.e., MSC10), suggesting unstable quality of the MSC10's data across sessions.Therefore, we excluded MSC08 and MSC10 from the analysis.
We used SPM12 (http://www.fil.ion.ucl.ac.uk/spm) to preprocess the resting-state fMRI data as follows: we first conducted realignment, unwraping, slice-timing correction, and normalization to a standard template (ICBM 152); then, we performed regression analyses to remove the effects of head motion, white matter signals, and cerebrospinal fluid signals; finally, we conducted band-pass temporal filtering (0.01-0.1 Hz).
We determined the ROIs of the whole-brain network using an atlas with 264 spherical ROIs whose coordinates were set in a previous study [54].We then removed 50 ROIs labelled 'uncertain' or 'subcortical', which left us with 214 ROIs.The 214 ROIs were labeled either of the nine functionally different brain networks, i.e., auditory network, dorsal attention network (DAN), ventral attention network (VAN), cingulo-opercular network (CON), default mode network (DMN), fronto-parietal network (FPN), salience network (SAN), somatosensory and motor network (SMN), or visual network.We merged the DAN, VAN, and CON into an attention network (ATN) to reduce the number of observables from nine to seven, as we did in our previous study [43].This is due to the relatively short length of the data and the fact that energy landscape analysis requires sufficiently long data sets if working with 9 observables.In fact, the DAN, VAN, and CON are considered to be responsible for similar attention-related cognitive activity [54], justifying the merge of the three systems into the ATN.We call the obtained N = 7 dimensional time series of the fMRI signal the whole-brain network.We calculated the fMRI signal for each of the seven networks (i.e., ATN, auditory network, DMN, FPN, SAN, SMN, and visual network) by averaging the fMRI signal over the volumes in the sphere of radius 4 mm in the ROI and over all ROIs belonging to the network.
In addition to the whole-brain network, we used a separate 30-ROI coordinate system [55] and determined the multi-ROI DMN and CON.We used a different parcellation system for the DMN and CON than the 264-ROI system used for the wholebrain network.It is because the former (i.e., 30-ROI) coordinate system provides much fewer ROIs for the DMN and CON than the 264-ROI system does, which is convenient for energy landscape analysis.The original study identified 12 and 7 ROIs for the DMN and CON, respectively [55].To reduce the dimension of the DMN, we averaged over each pair of the symmetrically located right-and left-hemisphere ROIs in the DMN into one observable.The symmetrized DMN, which we simply call the DMN, has eight ROIs because four ROIs (i.e., amPFC, vmPFC, pCC, and retro splen) in the original DMN are almost on the midline and therefore have not undergone the averaging between the right-and left-hemisphere ROIs [42].For the CON, we used the original seven ROIs as the observables.Note that the whole-brain network contains the DMN and CON as single observables, whereas the DMN and CON we are introducing here are themselves systems containing N = 8 and N = 7 observables, respectively.
We denote the fMRI signal for the ith ROI at time t by x t i (i = 1, . . ., N ; t = 1, . . ., t max ), where N is the number of ROIs, and t max is the number of time points.We then removed the global signals and transformed the signals into their z-values using z t i = (x t i − m t )/s t , where m t and s t represent the mean and standard deviation, respectively, of x t i over the N ROIs at time t; m t is the global signal [56].The global signal in resting-state functional MRI data is considered to be dominated by physiological noise mainly originating from the respiratory, scanner-related, and motion-related artifacts.Global signal removal improves various quality-control metrics, enhances the anatomical specificity of functional-connectivity patterns, and can increase the behavioral variance [57,58].The same or similar global signal removal was carried out in previous energy landscape studies [41,42].

Human Connectome Project data
For validation, we also analyzed fMRI data that were recorded from healthy human participants and shared as the S1200 data in the Human Connectome Project (HCP) [59].In the data set, 1200 adults between 22-35 years old underwent four sessions of 15-min EPI sequence with a 3T Siemens Connectome-Skyra (TR = 0.72 s, TE = 33.1 ms, 72 slices, 2.0 mm isotropic, field of view (FOV) = 208 × 180 mm) and a T1-weighted sequence (TR = 2.4 s, TE = 2.14 ms, 0.7 mm isotropic, FOV = 224 × 224 mm).Here, we limited our analysis to those included in the 100 unrelated participant subset released by the HCP.We confirmed that all these 100 participants were among the subset of participants who completed diffusion weighted MRI as well as two resting-state fMRI scans.
The resting-state fMRI data of each participant are composed of two sessions, and each session is broken down into a Left-Right (LR) and Right-Left (RL) phases.We used data from participants with at least 1150 volumes in each of the four sessions after removing volumes with motion artifacts, which left us with 87 participants.For the 87 participants, we first removed the volumes with motion artifacts.Then, we used the last 1150 volumes in each session to remove possible effects of transient.
We used independent component analysis (ICA) to remove nuisance and motion signals [60].Furthermore, any volumes with frame displacement greater than 0.2 mm [61] were excised [62] because the ICA-FIX pipeline has been found not to fully remove motion-related artifacts [63,64].We standardized each voxel by subtracting the temporal mean.Lastly, global signal regression of the same form as that for the MSC data (see section 2.1) was used for removing remaining noise.
In each volume, we averaged the fMRI signal over all the voxels within each ROI of the AAL atlas [65].Note that this atlas is composed of 116 ROIs.Then, we mapped each cortical ROI to either of the parcellation scheme from the Schaefer-100 atlas [66].System assignment was based on minimizing the Euclidian distance from the centroid of an ROI in the AAL to the corresponding centroid of an ROI in the Schaefer atlas.We removed 42 ROIs labeled 'subcortical' or 'cerebellar', which left us with 74 ROIs.These 74 ROIs were labelled either of the N = 7 functionally different brain networks: control network, DMN, DAN, limbic network, salience/ventral attention network, somatomotor network, and visual network, altogether defining a whole-brain network.

Fitting of the pairwise maximum entropy model
To carry out energy landscape analysis, we fit the pairwise maximum entropy model (MEM), also known as the Ising model, to the preprocessed fMRI data in essentially the same manner as in previous studies [38,67].
For each session, we first binarized z t i for each ith ROI (with i ∈ {1, . . ., N }) and time t (with t ∈ {1, . . ., t max }) using a threshold that we set to the time average of z t i .A computational study showed that binarization did not affect important information contained in originally continuous brain signals [68].We denote the binarized signal at the ith ROI and time t by σ t i , which is either +1 or −1 corresponding to whether z t i is larger or smaller than the threshold, respectively.The activity pattern of the entire network at time t is described by the Ndimensional vector It should be noted that there are 2 N activity patterns in total, enumerated as V 1 , . .., V 2 N .The empirical mean activity at the ith ROI is denoted by The empirical mean pairwise joint activation for the ith and jth ROIs is defined by The pairwise MEM maximizes the entropy of the distribution of activity patterns under the condition that ⟨σ i ⟩ and ⟨σ i σ j ⟩ (with 1 ≤ i ≤ j ≤ N ) are the same between the estimated model and the empirical data.The resulting probability distribution of activity pattern V = [σ 1 , . . ., σ N ], denoted by P (V ), obeys the Boltzmann distribution [69] given by where E(V ) represents the energy of activity pattern V given by In Eq. ( 5), the fitting parameter h i represents the tendency for the ith ROI to be active (i.e., σ i = +1), and J ij quantifies the pairwise interaction between the ith and jth ROIs.We denote the mean activity and mean pairwise activity from the estimated model by ⟨σ i ⟩ m and ⟨σ i σ j ⟩ m , respectively.By definition, we obtain and We calculated h i and J ij by iteratively adjusting ⟨σ i ⟩ m and ⟨σ i σ j ⟩ m towards the empirically values, i.e., ⟨σ i ⟩ and ⟨σ i σ j ⟩, respectively, using a gradient ascent algorithm.The iteration scheme is given by and where superscripts new and old represent the values after and before a single updating step, respectively, and ϵ is the learning rate.We set ϵ = 0.2.

Accuracy of fit
We evaluated the accuracy of fit of the pairwise MEM to the given fMRI data [38,42,50].The accuracy index is given by where is the Kullback-Leibler divergence between the probability distribution of the activity pattern in the ℓth-order (ℓ = 1, 2) MEM, P ℓ (V ), and the empirical probability distribution of the activity pattern, denoted by P N (V ).Note that P 2 (V ) is equivalent to P (V ) given by Eqs. ( 4) and (5).The first-order, or independent, MEM (i.e., ℓ = 1) is Eq. ( 4) estimated without interaction terms, that is, J ij = 0 ∀i, j in Eq. (5).We obtain r D = 1 when the pairwise MEM perfectly fits the empirical distribution of the activity pattern, and r D = 0 when the pairwise MEM does not fit the data any better than the independent MEM.To assess the dependency of r D on the number of sessions to be concatenated for the estimation of the pairwise MEM, m, the network (i.e., whole-brain, DMN, or CON), and the type of concatenation (i.e., within-participant or between-participant), we examined the multivariate linear regression model given by In Eq. ( 12), β 0 is the intercept, dummy variable I whole is equal to 1 for the whole-brain network and 0 for the other two networks, I CON is equal to 1 for the CON, and 0 for the other two networks, and I within is equal to 1 for the within-participant comparison and 0 for the between-participant comparison.

Bayesian approximation method
The pairwise MEM and the subsequent energy landscape analysis have mostly been restricted to analysis of group-level data.This is because the methods in its original form are datahungry, requiring concatenation of fMRI signals from different individuals.If there are sufficiently many or long sessions of fMRI data from a single participant, as in the present study, one can only concatenate the data from the same participant and thus avoid group-level energy landscape analysis.However, fMRI data from a single participant are more often than not too short to allow individual-level energy landscape analysis.In addition, the length of fMRI data, t max , that is necessary for reliably estimating the pairwise MEM with N nodes is roughly proportional to the number of states, 2 N [38].To overcome this problem and obtain the energy landscape for each individual, we employed a recently developed variational Bayes approximation method for estimating the pairwise MEM [40,70], which runs as follows.
We denote by S n the N -dimensional time series obtained from an nth session of fMRI.Different fMRI sessions typically originate from different participants in the same group (e.g., control group).We denote the number of sessions available by D. Let S be the concatenated data, i.e.,

S ≡ ∪
The variational Bayes approximation method estimates a pairwise MEM for each S n (with n ∈ {1, . . ., D}).This method introduces a prior distribution for the set of session-specific model parameters, θ n = (h 1 , h 2 , . . ., h N , J 12 , J 13 , . . ., J N −1,N ) ∈ R M , where n ∈ {1, . . ., D} and M = N (N + 1)/2.We give the prior distribution for by where p(x|N (µ, σ 2 )) represents the probability density of x obeying the one-dimensional normal distribution with mean and variance equal to µ and σ 2 , respectively.Here, η = (η 1 , . . ., η M ) ⊤ ∈ R M is the prior mean vector, α = (α 1 , . . ., α M ) ⊤ ∈ R M + is the prior precision vector, and ⊤ represents the transposition.In Eq. ( 15), we have assumed that the signals from all the D sessions are mutually independent.Now, we derive the posterior distribution of Θ.It is intractable to derive the posterior because the normal distribution is not the conjugate prior for the Boltzmann distribution.Therefore, we use a variational approximation to the posterior [71] using the normal distribution as follows: + , which are the posterior mean vector and the posterior precision vector for session n ∈ {1, . . ., D}, respectively.One obtains the variational approximate solution for distribution q by optimizing the evidence lower bound (ELBO), also called the free energy [40,70].By maximizing the free energy with respect to q, we have the posterior mean and precision vectors in terms of the prior mean and precision vectors as follows: where and diag(•) represents the diagonal matrix whose entries are given by the arguments.In Eq. ( 17), is the vector composed of the empirical mean activity and empirical pairwise joint activation; ⟨σ⟩ η is the model mean of σn ≡ is the covariance matrix of σn when the model is given by η.In Eq. ( 18), c η is the vector composed of the diagonal element of C η .In other words, the ith element of c η is the variance of the ith element of σn under parameters η.
Now, we fix q and maximize the free energy with respect to η and α to obtain the equations for updating η and α as follows: where M ′ ∈ {1, . . ., M }.
Thus, we have updated the posterior distribution , and then updated the prior distribution using the new posterior distribution.We summarize the steps of the variational Bayes approximation method as follows: 1. Initialize the hyperparameters by independently drawing each η M ′ (with M ′ ∈ {1, . . ., M }) from the normal distribution with mean 0 and standard deviation 0.1.We also set the first N entries of the prior precision vector α M ′ , corresponding to h i , i ∈ {1, . . ., N }, to 6, and set the remaining

Energy landscape and disconnectivity graph
Once we have estimated the pairwise MEM, we calculated the energy landscape [36][37][38]42].The energy landscape is defined as a network with 2 N nodes in which each node is an activity pattern.We first constructed a dendrogram called the disconnectivity graph.We show a hypothetical disconnectivity graph in Fig. 1.See Appendix A 9 for visualization of a real disconnectivity graph and its randomized counterpart.In the disconnectivity graph, a leaf corresponds to an activity pattern V k that is a local minimum of the energy.There are four local minima in the disconnectivity graph shown in Fig. 1.The vertical position of the leaf represents the energy value of the local minimum.A low energy value corresponds to a high frequency of appearance through Eq. (5).For example, in Fig. 1, activity pattern γ 1 is the one that appears with the highest frequency among all the 2 N activity patterns.
By definition, activity pattern V k is a local minimum of energy if and only if V k appears more frequently (thus has a lower energy) than any other activity pattern adjacent to V k .Two activity patterns are defined to be adjacent in the network of activity patterns if and only if they have the opposite activity σ i ∈ {−1, 1} for just one i.Note that the network of activity patterns is the hypercube composed of 2 N nodes in which each node representing an activity pattern is adjacent to N other nodes.To obtain the disconnectivity graph, we first enumerate the local minima.Then, for each pair of local minima γ and γ ′ , we determine the smallest energy value E th that a path connecting γ and γ ′ needs to go through as follows.There may be various paths connecting γ and γ ′ .Then, we sequentially remove nodes in the descending order of the energy until there is no path connecting γ and γ ′ .The energy of the node that we have removed the last is the E th value for γ and γ ′ .The horizontal dashed line in Fig. 1 indicates the E th value (= −0.69) for the pair of local minima γ 1 and γ 2 .The difference between E th and the energy at the local minimum represents the energy barrier that the dynamics of the brain have to overcome to reach from one local minima to the other.In Fig. 1, the energy barrier between γ 1 and γ 2 from the viewpoint of γ 1 is 0.64, which is indicated by the double-headed arrow.The disconnectivity graph shows E th and the energy barrier values for all pairs of the local minima.

Measures of discrepancy
To assess within-participant test-retest reliability of energy landscape analysis, we compared two energy landscapes that we separately estimated for two sets of fMRI data, which were in different sessions for the same participant or obtained from different participants.We decided to make within-participant versus between-participant comparisons because a successful individual fingerprinting requires that the within-participant test-retest reliability is high enough, whose examination requires a baseline.Higher within-participant test-retest reliability than between-participant one implies that the energy landscape analysis provides reliable fingerprints for individuals.To analyze test-retest reliability, we measured the following four indices of the discrepancy between the two energy landscapes.

Discrepancy in terms of the interaction strength
The energy landscape is primarily a function of {J ij } i,j∈{1,...,N } because {h 1 , . . ., h N } tend to take values close to 0 if we set our threshold to binarize z t i such that the fraction of σ i = −1 and that of σ i = 1 are not heavily imbalanced [41].Therefore, we measured the discrepancy between two energy landscapes in terms of the estimated {J ij }.We define the discrepancy using the Frobenius distance as follows: where ij ) denote the pairwise interaction matrices according to the pairwise MEM estimated for the first and second data sets, respectively.

Discrepancy in terms of the activity patterns at the local minima of the energy
A local minimum of the energy landscape is locally the most frequent activity pattern.We compared the location of the local minima in the two energy landscapes by calculating the Hamming distance between the activity patterns at the local minima from the first energy landscape and those from the second energy landscape as follows.
First, we assumed that minor local minima characterized by low energy barriers with other local minima did not play important roles because the brain state would stay near such shallow local minima only briefly.Therefore, we started by removing minor local minima of the energy as follows.We generated N random binary time series of length 4t max by independently drawing the N ×4t max binary numbers, i.e., −1 or +1, with the same probability (i.e., 0.5).The multiplication factor was set at 4 because we mainly analyzed energy landscapes of the empirical fMRI data with t max volumes that were concatenated over four sessions.Then, we inferred the pairwise MEM for the generated random binary time series and calculated the maximum length of the branch in the disconnectivity graph.A branch corresponds to a local minimum of the energy landscape.We define the branch length for local minimum γ by the smallest value of the energy barrier between γ and another local minimum γ ′ among all local minima γ ′ (̸ = γ).In the disconnectivity graph shown in Fig. 1, the branch length for γ 1 is the length of the arrow.We claim that the energy landscape estimated for the random binary time series, including the number and depth of its local minima, does not have functional meanings.Therefore, in an energy landscape estimated for the empirical data, the local minima whose branch length is comparable with the maximum branch length for the random binary time series are not important.
To implement this idea, we generated random binary time series, inferred the energy landscape, computed its maximum branch length, and repeated all these steps 100 times.We denote the average and standard deviation of the maximum branch length on the basis of the 100 random binary time series by µ ′ and σ ′ , respectively.We then identified the local minimum with the shortest branch length in the original disconnectivity graph.We removed that local minimum as being insignificant if its branch was shorter than µ ′ + 2σ ′ .If we removed this local minimum, we recomputed the branch length of each local minimum whose branch had merged with the removed branch.Then, if the shortest branch was shorter than µ ′ + 2σ ′ , we removed the branch and repeated these steps until all the local minima have branches whose length is at least µ ′ + 2σ ′ .We refer to the local minima that survive this test as major local minima.
We denote the activity patterns at the major local minima of the first energy landscape by m1 , where m 1 is the number of the major local minima in the first energy landscape.Similarly, we denote the activity patterns at the major local minima of the second energy landscape by m2 }, we need to match the major local minima between the two energy landscapes.To this end, we assume without loss of generality that m 1 ≤ m 2 and pair each Note that m 2 − m 1 major local minima in the second energy landscape are not matched to any major local minimum in the first energy landscape.We quantify the quality of a matching 3 )}, we obtain d H = 5.67.We calculate d H for all the possible 24 matchings.The smallest d H value is 0.67.We adopt the matching that minimizes d H , i.e., {( by where is the activity pattern at the major local minimum paired with ℓ in the considered matching; d ′ H is the Hamming distance between the N -dimensional binary vectors Ṽ ℓ and Ṽ (2) ρ(ℓ) , i.e., the number of ROIs whose binary activity (i.e., ℓ and Ṽ (2) ρ(ℓ) .We calculate d H for all the possible matchings and select the one that minimizes d H , which we simply refer to as d H hereafter.A small d H value implies that the two energy landscapes are similar in terms of the activity patterns at the local minima of energy.

Discrepancy in terms of the activity patterns averaged over the attractive basin
Brain dynamics tend to visit local minima of the energy landscape but also fluctuate around it.Therefore, we additionally measured a distance between the two energy landscapes in terms of the activity patterns averaged over the attractive basin of local minima as follows.
Consider a major local minimum of the first energy landscape, Ṽ ℓ .The attractive basin of ℓ is a set of activity patterns.By definition, V is in the attractive basin of Ṽ ℓ if and only if the gradient-descent walk starting from V eventually reaches Ṽ (1) ℓ .The gradient-descent walk on the set of activity patterns is defined by a series of moves from an activity pattern to another such that the move from V is allowed only when the next activity pattern is the one that attains the smallest energy (i.e., largest probability of appearance) among the neighbors of V .Intuitively, if we release a ball at V , the ball following the gradient moves on the energy landscape until it reaches Ṽ ℓ and stops there if there is no dynamical noise.We calculate the average of the activity patterns within the attractive basin of Ṽ (1) ℓ , which we denote by u ℓ .Note that u (1) ℓ is an N -dimensional vector, which we assume to be a column vector, whose ith entry is the average of σ i ∈ {−1, 1} over all the activity patterns in the attractive basin of Ṽ (1) ℓ .Similarly, denote by u given by where ∥ ∥ denotes the Euclidean norm of the vector.The d ′ basin value ranges between 0 and 2. A small value of d ′ basin indicates a stronger alignment between u ℓ ′ .For a given matching ρ, we then define ℓ , u which quantifies overall discrepancy between the two energy landscapes in terms of the average activity pattern in the attractive basin of the local minimum.We calculate d basin for all the possible matchings between { Ṽ (1) 1 , . . ., m2 } and adopt the smallest value, which we also refer to as d basin for simplicity.In a majority of cases, the best matching determined by the minimization of d H and that determined by the minimization of d basin are the same.However, they are sometimes different from each other.

Discrepancy in terms of the branch length
As a fourth measure to characterize energy landscapes, we quantified the ease with which the activity pattern switches from one major local minimum to another.We call it the normalized branch length.Then, we compared the normalized branch length between two energy landscapes.
We compute the normalized branch length as follows.We first calculate the length of the branch corresponding to each major local minimum γ as the difference between the energy value of γ and the smallest energy value at which γ joins the branch of another major local minimum on the disconnectivity graph.The calculated branch length quantifies the difficulty of transitioning from γ to another local minimum.
We assume that there are m 1 and m 2 major local minima from the first and second energy landscapes, respectively.We denote by L (1) and L (2) the average of the branch length over the m 1 corresponding branches in the first energy landscape and over the m 2 branches in the second energy landscape, respectively.Then, we define the normalized branch length difference between the two energy landscapes by L (1) − L (2) max(L (1) , L (2) ) .

Nonparametric statistical analysis
We examine whether the energy landscapes estimated from different fMRI data from the same participants are more similar to each other than the energy landscapes estimated for two different groups of participants.We argue that, if the energy landscape analysis is useful, the energy landscapes estimated from the same participants should be closer to each other than the energy landscapes estimated from different participants.
First, we consider the MSC data with the conventional likelihood maximization method, for which we need to concatenate fMRI data over sessions to estimate one energy landscape with a reasonably high accuracy.For expository purposes, we consider one of the four discrepancy measures, say, d J .We also focus on the pth participant.We first calculate d J between J (1)  and J (2) , where we estimate J (1) for the fMRI data concatenated over four sessions that are uniformly randomly selected out of the ten sessions, s ∈ {1, . . ., 10}, and J (2) from the fMRI data concatenated over another uniformly randomly selected four sessions.We impose that the second set of four sessions does not overlap with the first set.Note that we use eight out of ten randomly selected sessions to calculate one d J value.We repeat this procedure ten times to obtain ten values of d J for the pth participant.By calculating ten values of d J for each of the eight participants, i.e., p ∈ {1, . . ., 7, 9}, we obtain 8 × 10 = 80 values of d J .We denote the average of the 80 values of d J by d 1 (see Fig. 3(a)).
Next, we calculate d J , with J (1) being estimated for the fMRI data concatenated over the sth sessions of four participants that are uniformly randomly selected out of the eight participants, and J (2) from the fMRI data concatenated over the sth sessions of the other four participants.We repeat this procedure ten times to obtain ten values of d J for the sth session.By calculating the ten values of d J for each of the ten sessions, i.e., s ∈ {1, . . ., 10}, we obtain 10×10 = 100 values of d J .We denote the average of the 100 values of d J by d 2 (see Fig. 3(a)).
Second, we consider the case in which we do not need to concatenate fMRI data before estimating an energy landscape.We again consider d J as an example.We first calculate d J between J (1) and J (2) , where J (i) , i = {1, 2}, is estimated for two sessions s and s ′ (̸ = s) for a participant p.It should be noted that s, s ′ ∈ {1, . . ., 10} and p ∈ {1, . . ., 7, 9} for the MSC data, and s, s ′ ∈ {1, . . ., 4} and p ∈ {1, . . ., 87} for the HCP data.By exhausting all pairs (s, s ′ ), we compute n 1 values of d J , where n 1 = 10 × 9/2 = 45 and n 1 = 4 × 3/2 = 6 for the MSC and HCP data, respectively, for each participant.We denote by d 1 the average of the 8n 1 or 87n 1 values of d J for the MSC or HCP data, respectively, which originate from all pairs (s, s ′ ) and all participants.
Next, we calculate d J , with J (1) and J (2) being estimated for the sth session for two different participants p and p ′ (̸ = p), where p, p ′ ∈ {1, . . ., 7, 9} for the MSC data and p, p ′ ∈ {1, . . ., 87} for the HCP data.By exhausting all pairs of participants (p, p ′ ), we compute n 2 values of d J , where n 2 = 8 × 7/2 = 28 and n 2 = 87 × 86/2 = 3741 for the MSC and HCP data, respectively, for each session.We denote by d 2 the average of the 10n 2 or 4n 2 values of d J for the MSC or HCP data, respectively.
We define ND ≡ d 2 /d 1 ; ND is named after normalized distance [72,73].If energy landscapes are more similar between different sets of sessions from the same participant (i.e., withinparticipant comparison) than between those from different participants (i.e., between-participant comparison), the ND value will be larger than 1.In this case, we regard that the energy landscape analysis bears high within-participant test-retest reliability.In contrast, if the energy landscape from the same participant is not particularly reliable across sessions, the ND will be close to 1.
To statistically examine whether ND is sufficiently larger than 1, we run a nonparametric permutation test, which is an adaptation of the same test in different studies [72,73].The steps of the permutation test based on the ND are as follows.
Here we use d J to explain the steps.See Fig. 3(b) for a schematic of randomized MSC data.
1. Consider the binarized N -dimensional fMRI time series data for each of the eight participants and each of the ten sessions.2. Uniformly randomly permute the 80 participant-session pairs in the case of the MSC data or 348 (= 87 × 4) participant-session pairs in the case of the HCP data.After the randomization, the fMRI data for the sth session from the pth participant is the fMRI data for a uniformly randomly selected session from a uniformly randomly selected participant without replacement.3. We calculate ND for the randomized data.For the combination of the MSC data and the conventional likelihood maximization method, this step entails concatenating the fMRI data over four random sessions from the same participant p or over the sth sessions from four random participants, estimate the energy landscapes for the concatenated data, comparing two energy landscapes to calculate d J , repeat this 80 times to obtain d 1 and 100 times to obtain d 2 , and compute ND = d 2 /d 1 .When the data are not concatenated (i.e., the MSC data with the variational Bayes method or the HCP data with the conventional method), we calculate d J for each pair of sessions from the same participant p and take the average to obtain d 1 .We also calculate d J for each pair of participants from each session s and take the average to obtain d 2 .Then, we set ND= d 2 /d 1 .4. Repeat steps (2) and (3) over c random permutations, where c is a large number.We set c = 10 3 .
5. Calculate the p value, which is the fraction of the random permutations that yield an ND value larger than that for the original data.6.If the p value is significantly small, then we reject the null hypothesis that d 1 = d 2 in the original data.In this case, we conclude the significant presence of within-participant test-retest reliability in the energy landscape analysis.
In step 3, we calculate d 1 and d 2 as the within-participant and between-participant averages, respectively.However, for the randomized data, they are statistically the same except that d 1 and d 2 are averages of 80 and 100 values of d J , respectively (for the case of the MSC data combined with the conventional likelihood maximization method).This is because d J calculated for both d 1 and d 2 originates from the comparison of the energy landscape estimated from uniformly randomly selected four out of the 80 sessions and another uniformly randomly selected four sessions without overlapping.Therefore, d 1 and d 2 have the same mean, and ND is expected to be peaked approximately at 1.The present permutation test thus evaluates whether the reliability of the energy landscape analysis across sessions for the same participant is higher than that across sessions for different participants.

Accuracy of fit of the pairwise MEM
We extracted N ROIs for three brain networks, i.e., the whole-brain network (N = 7), DMN (N = 8), and CON (N = 7).For each of them, we estimated the pairwise MEM for the resting-state fMRI signals obtained from healthy adults in the MSC data set.
We calculated r D , the accuracy of fit of the pairwise MEM, for each pair of participant and session.We obtained r D = 69.12± 6.41% (average ± standard deviation) for the whole brain network, r D = 57.97± 8.94% for the DMN, and r D = 77.65 ± 5.41% for the CON (also see Table 1).
Because the accuracy of fit is not high enough, as is customarily done, we concatenated the data across participants or across sessions, estimated the pairwise MEM, and calculated r D [37,38].Specifically, we concatenated the fMRI data across m sessions, where m ∈ {2, 3, 4, 5}.The m sessions are from the same participant but from m different sessions, or have the same session ID (i.e., s) but from m different participants.We show in Table 1 the average and standard deviation of r D for the three networks when we concatenated m ∈ {2, 3, 4, 5} sessions from the same participant.Table 2 shows the r D values when we concatenated m sessions from different participants.In both Tables 1 and 2, as expected, r D increases as m increases (β 1 = 3.09 in Eq. ( 12); p = 4.60 × 10 −9 ).Furthermore, r D is larger with the within-participant than across-participant concatenation (β 4 = 3.51 in Eq. ( 12); p = 6.04 × 10 −5 ).The latter result indicates that the energy landscape estimated through the within-participant concatenation of the fMRI data is more accurate than that estimated through the between-participant concatenation in terms of the accuracy of fit of the pairwise MEM.In both Tables 1 and 2, the accuracy for the DMN is substantially lower than that for the whole-brain network (β 2 = 10.34 in Eq. ( 12); p = 1.63 × 10 −10 ) and the CON (β 3 = 14.31 in Eq. ( 12); p = 5.54 × 10 −13 ).This is presumably because the DMN has one more ROI than the whole-brain network and the CON.The accuracy decreases as the number of ROIs increases in general [38].
In the following analyses, we use concatenation over m = 4 sessions and examine test-retest reliability of the energy landscape analysis.Figure 3(a) schematically explains the concatenation within each participant and that across participants.With m = 4, the accuracy of fit is more than 85% except for the DMN.In general, we are also interested in the test-retest reliability of fMRI data in the case of a relatively low accuracy of fit, which we test with the DMN.A concatenation over more sessions, such as with m = 5, would further increase the accuracy of fit (see Tables 1 and 2).Then, however, examining test-retest reliability may be more difficult because one needs to create two energy landscapes, preferably from non-overlapping data, and systematically compare them.In the present study, we use data obtained from eight participants.Therefore, if m = 5, one cannot avoid overlapping of the participants if we create two groups of participants for concatenating the fMRI data.Our choice of m = 4 balances the accuracy of fit and the tractability of the test-retest reliability analysis.
Figure 4: Reliability of the interaction strength between two ROIs, J ij , for the whole-brain network.(a) Within-participant comparison.We concatenated the data for the first participant over four sessions.The horizontal and vertical axes correspond to the concatenation of sessions 1 to 4 and sessions 5 to 8, respectively.Each circle represents a pair of i and j.(b) Betweenparticipant comparison.We concatenated the data from the first session over four participants.The horizontal and vertical axes correspond to the concatenation of the first and last four participants, respectively.In both (a) and (b), if all the circles lay on the diagonal, which we show by the solid lines, then the discrepancy, d J , would be equal to 0. The d J value is large if the circles tend to be far from the diagonal.

Reliability in terms of the interaction strength
We first examined the test-retest reliability of the energy landscape analysis in terms of the interaction strength parameters {J ij }.We concatenated the fMRI data over the first four sessions from the pth participant and estimated {J ij } for each p ∈ {1, 2, 3, 4, 5, 6, 7, 9}.Similarly, for each participant p, we concatenated the data over the next four sessions (i.e., sessions 5 to 8) and estimated {J ij }.  1: Accuracy of fit of the pairwise MEM when we concatenate fMRI data within the same participant.Each cell represents the accuracy of fit in percent when we concatenate the fMRI data across sessions from the same participant.We concatenated data from a given participant over m sessions and then fitted the pairwise MEM to the concatenated data.With m = 2, we partitioned the 10 sessions into 5 groups as (1,2), (3,4), (5,6), (7,8), and (9,10), concatenated the fMRI data within each group and within each participant, estimated the energy landscape, and computed the accuracy of fit, r D .For example, we concatenated the data from the first two scanning sessions from participant 1, estimated the energy landscape, and computed r D .We did the same for data from the third and fourth sessions from participant 1, the first and second sessions from participant show the relationships between J ij estimated for the first four sessions against that estimated for the next four sessions for the first participant in Fig. 4(a).Each circle represents J ij for a pair of i and j.The values of {J ij } are reasonably consistent between the first four sessions and the next four sessions (Pearson correlation coefficient = 0.850; discrepancy d J = 0.0428).
We instead concatenated the data for a single session over the first four participants (i.e., p = 1, 2, 3, and 4) and estimate {J ij }, did the same for the last four participants (i.e., p = 5, 6, 7, and 9), and compared the two obtained sets of {J ij }.In this manner, we investigated the consistency of the energy landscape between participants.For the whole-brain network, we show relationships between {J ij } for the two sets of participants in the first session in Fig. 4(b).Similar to the case of Fig. 4(a), the estimated {J ij } was reasonably consistent between the two concatenations, consistent with previous results with other data [42,67,74].However, the degree of consistency was smaller for the present between-participant comparison (Pearson correlation coefficient = 0.773; discrepancy d J = 0.0493) than the within-comparison comparison.In this particular example, the estimation of {J ij } was more consistent between pairs of sessions from the same participant than those from different participants.
To examine the generality of this result, we then calculated d J between the concatenation across sessions 1 to 4 and that across sessions 5 to 8 from the same participant (i.e., withinparticipant comparison).We concatenated data from a given session over m participants and then fitted the pairwise MEM to the concatenated data.With m = 2, we concatenated the data for the sth session from participants p = 1 and 2 into one time series, those from participants p = 3 and 4 into another series, those from participants p = 5 and 6 into another series, and those from participants p = 7 and 9 into another series.We did this for each s.With m = 3, we concatenated the data from participants p = 1, 2, and 3 into one series and those from participants p = 4, 5, and 6 into another series.With m = 4, we concatenated the data from participants p = 1, 2, 3, and 4 into one series and those from participants p = 5, 6, 7, and 9 into another series.With m = 5, we concatenated the data from participants p = 1, 2, 3, 4, and 5 into one series.
Note that the results with m = 1 shown in this table are identical with those in Table 1.The average and standard deviation were computed across the sessions and the different manners to concatenate m participants given the session.
d J over the eight participants were equal to d J = 0.0464 ± 0.0082 (mean ± std) for the whole-brain network.We also calculated d J between the concatenation of the sth section over the first four participants and that over the last four participants (i.e., between-participant comparison).The mean and standard deviation of d J for the between-participant comparison over the ten sessions were equal to d J = 0.0527 ± 0.0098.We show these d J values and those for the DMN and CON in Table 3.The table suggests that the energy landscape is apparently somewhat more similar between different fMRI sessions obtained from the same participant than between different participants.
To statistically investigate potential differences between the within-participant and between-participant comparisons, we carried out the permutation test on d J .The ND for the wholebrain network, DMN, and CON were at least 1.3 (see Table 4).After a random permutation of the participants and sessions, the ND value was centered around 1 by definition.We show the distribution of the ND value obtained from c = 10 3 random permutations in Fig. 5(a), (b), and (c) for the whole-brain network, DMN, and CON, respectively.We calculated the p value for the empirical data by contrasting it to the distribution of ND for the randomized data.We obtained p < 10 −3 for all the three networks, implying that no random permutation yielded an ND value larger than that for the empirical data before the random permutation.These results remained significant after correction for the multiple comparison present in Table 4 (p < 1.2×10 −2 , Bonferroni corrected).Therefore, we conclude that the estimated parameter values, {J ij }, are significantly more reliable in the within-participant than between-participant comparison Table 4: ND values and the permutation test results for the four discrepancy measures, calculated with the conventional likelihood maximization applied to the MSC data.The p values are the uncorrected values.
for the three networks.

Reliability in terms of the activity patterns at the local minima
As a second index of the consistency between different energy landscapes, we compared the activity patterns at the local minima of the energy landscape between energy landscape pairs in terms of the Hamming distance, d H . Table 3 indicates that the average d H is at least 1.6 times larger for the betweenparticipant than within-participant comparison for the wholebrain network, DMN, and CON.
The ND value was at least 1.73 for the three networks (see Table 4).The permutation test yielded p < 10 −3 for all the three networks; see Fig. 5(d)-(f) for the distribution of the ND values for the random permutations.These results altogether support that the reliability of the energy landscape analysis in terms of d H is higher within the same participant than between different participants.

Reliability in terms of the activity patterns averaged over the attractive basin
As a third index to characterize the consistency between energy landscapes, we measured the distance between the average activity patterns belonging to the attractive basin of a local minimum in one energy landscape and that in another energy landscape, i.e., d basin .Similarly to the case of d J and d H , we found that d basin is substantially smaller for the within-participant than between-participant comparison for the three networks although the standard deviation is not small (see Table 3).It should be noted that the observed d basin values are close to 0 for both the within-participant and between-participant comparisons.This result implies the almost full agreement between a pair of energy landscapes in terms of the averaged activity pattern in the attractive basin, even for the between-participant comparison.
We show in Fig. 5(g)-(i) for the distribution of the ND values for the random permutations as well as the ND values for the original energy landscapes.The permutation test yielded p < 10 −3 for the whole-brain network and the DMN and p = 0.003 for the CON (Table 4).These results support a significantly high test-retest reliability of the energy landscape analysis in terms of d basin including the case of the CON after correction for multiple comparisons across the networks and indices (p = 0.036, Bonferroni corrected).

Reliability in terms of the branch length
As a last index of consistency of the energy landscape, we measure the normalized difference in the average branch length in the disconnectivity graph, d L , between two energy landscapes.We found that the average of d L was smaller for the within-participant than between-participant comparison for the three networks (see Table 3).The permutation test yielded p = 0.014 for the whole-brain network, and p < 10 −3 for the DMN and CON; see Table 4 and Fig. 5(j)-(l).These results support a significantly high test-retest reliability of the energy landscape analysis in terms of d L for the DMN and CON although the result for the whole-brain network did not survive correction for multiple comparison (p = 0.17, Bonferroni corrected).

Accuracy and reliability of the variational Bayes approximation method
The Bayesian estimation potentially allows us to reliably estimate an energy landscape without concatenating fMRI data across sessions or participants even if a single session is not long.Therefore, we repeated the same test-retest reliability analysis on the MSC data with the Bayesian estimation and without any concatenation.
After running the variational Bayes approximation method to compute the hyperparameters, we calculated the accuracy of fit, r D , of the pairwise MEM.We obtained r D = 86.02± 2.79%, 91.50±3.21%,and 93.51±1.48%for the whole-brain network, DMN, and CON, respectively.These high accuracy values support the effectiveness of the method.
We show the mean and standard deviation of the four discrepancy indices for the within-participant and between-participant comparison in Table 5.For some combinations of the session, participant, and network, the Bayesian method yielded an energy landscape with just one local (and hence global) minimum of the energy.In this case, we set the branch length to be 0. Table 5 suggests that the within-participant consistency of energy landscape analysis is notably higher than the betweenparticipant consistency in terms of the four discrepancy measures although the standard deviation is large.These results are qualitatively the same as those obtained with the conventional likelihood maximization method described in sections 3.2-3.5.However, the discrepancy values are substantially larger with the Bayesian method (see Table 5) than the likelihood maximization method (see Table 3) for both within-participant and between-participant comparisons with few exceptions.
Table 6 shows the results of the permutation test for the three networks and four discrepancy measures.We find significantly higher reliability within the same participant than between different participants in terms of d J , d H and d basin .In terms of d L , the uncorrected p values were smaller than 0.05 but did not survive the Bonferroni correction for the whole-brain network and DMN.These results were similar to those for the likelihood maximization method.However, comparison of Tables 4  and 6 reveals that the ND value with the Bayesian method was smaller than that with the likelihood maximization method for all the four discrepancy measures and all the three networks.Therefore, we conclude that the Bayesian method yields significantly higher reliability within the same participant than between different participants in most cases, whereas the reliability is somewhat weaker than in the case of the conventional likelihood maximization method.

Validation with the Human Connectome Project data
As a different type of validation, we ran the test-retest reliability analysis for another fMRI data set, HCP data.We used a whole-brain network with N = 7 ROIs.We calculated the accuracy of fit, r D , of the pairwise MEM estimated with the likelihood maximization method to single-session data.We obtained r D = 92.49± 1.99%, where we calculated the average and standard deviation on the basis of the four sessions per participant and all the participants.Table 7 shows the mean and standard deviation of the four discrepancy indices, compared between the within-participant and between-participant comparison.The results are similar to those for the MSC data.The ND values for d J , d H , d basin , and d L are 1.310, 1.152, 1.249, and 1.152, respectively.The permutation test yielded p < 10 −3 for all the four discrepancy indices.These results confirm significantly higher within-participant than betweenparticipant test-retest reliability of the energy landscape analysis with a different data set.

Permutation test by shuffling the participants within each session
As another validation, we carried out a different variant of the permutation test in which we shuffled the participants within each session; the same shuffling was employed in previous studies [75,76].Most combinations of the discrepancy measure and the network showed significantly small p values after the Bonferroni correction for the different data sets and energy landscape inference methods.(See Appendix B 10 for the detailed methods and results.)This result is similar to the case of   those with our original shuffling method schematically shown in Fig. 3.

Discussion
We examined test-retest reliability of the energy landscape analysis in terms of four indices.For each index, we calculated a discrepancy in the index value between two estimated energy landscapes.We then constructed and ran a permutation test on the calculated discrepancy value to statistically assess whether within-participant comparison of two energy landscapes yielded a smaller discrepancy value than betweenparticipant comparison of two energy landscapes.For the two data sets, we found significant within-participant test-retest reliability (i.e., within-participant discrepancy being significantly smaller than between-participant discrepancy) in most cases.Furthermore, we found qualitatively the same results for a Bayesian variant of the energy landscape estimation method that enables us to estimate an energy landscape for each scanning session, mitigating the data-hungry nature of the original estimation method.
The accuracy of fit measured by r D was large for the vari-  7: Discrepancy between two energy landscapes estimated by the conventional likelihood maximization method applied to the HCP data."Within" and "Between" in the table stand for within-participant and between-participant, respectively.We computed the average and standard deviation of d 1 and d 2 across the participants and across the sessions, respectively.ational Bayes approximation method (i.e., 86.02, 91.50, and 93.51% on average for the whole-brain network, DMN, and CON, respectively) although we did not concatenate the fMRI data across different sessions.These r D values are close to that with the conventional likelihood maximization method with concatenation of four or five sessions (see Tables 1 and 2).The high accuracy of the variational Bayes method is presumably due to the fact that the target empirical distribution of activity patterns, i.e., P N (V i ) in Eq. (11), is necessarily different between the two estimation methods.Specifically, P N (V i ) is the empirical distribution over all sessions for non-Bayesian estimation methods, whereas it is the empirical distribution for one session for Bayesian methods.We do not ascribe the higher accuracy of fit of the variational Bayes method to overfitting.The variational Bayes method yields a Boltzmann distribution for each session.Therefore, it uses M × D parameters, where we remind that M = N (N + 1)/2 is the number of parameters of the Boltzmann distribution and D = 80 is the number of sessions.Therefore, it uses D times more parameters than the conventional likelihood maximization method, which uses M parameters to estimate one Boltzmann distribution.However, the variational Bayes method needs to produce an accurate Boltzmann distribution tailored to a single session to attain a high r D value, which is not the case for the conventional likelihood maximization method.In general, the accuracy of the pairwise MEM simply degrades if the data are shorter (see Tables 1 and  2; also see [38] for a systematic analysis on the effect of the data length on the accuracy).Our results that the variational Bayes method yields a higher accuracy of fit and higher consistency in the within-participant than between-participant comparison both support that individual-to-individual differences are not negligible when carrying out energy landscape analysis.While such individual differences were a motivation behind the original proposals of the Bayesian methods [40,70], further comparisons of Bayesian and non-Bayesian estimation methods as well as pursuit of biological and medical relevances of energy landscapes estimated with the Bayesian methods remain future work.
The significance of the test-retest reliability results obtained with the permutation test was similar between the likelihood maximization and variational Bayes methods.However, the ND values were larger for the likelihood maximization than the variational Bayes method.As a separate result, the discrepancy indices were overall smaller for the likelihood maximization than Bayesian method.The latter two results are in favor of the likelihood maximization over Bayesian method for realizing high test-retest reliability.However, we point out that the estimation of an energy landscape for the likelihood maximization requires concatenation of four sessions, whereas the Bayesian method avoids concatenation.Assessment of test-retest reliability for different Bayesian approximation methods [70] and other approximate methods such as the pseudo-likelihood maximization [38,41], including systematic analysis on the dependence of the results on the data length, is left as future work.
The intraclass correlation coefficient (ICC) has been widely used for investigating test-retest reliability in functional connectivity data [21].We did not use the ICC because our quantification of the estimated energy landscape was mostly multidimensional and difficult to fit to an ANOVA or similar framework based on which most ICC measures are calculated.Specifically, {J ij }, based on which we calculated d J , is a N (N − 1)/2-dimensional vector.In addition, we calculated d H and d basin by examining the activity patterns at local minima and their average over the attractive basin, respectively, in the situation where the number of the local minima varies in one energy landscape from another.Therefore, we decided to calculate a discrepancy measure for each of the four indices between two energy landscapes and constructed a permutation test to examine test-retest reliability.We point out that the average branch length is a scalar characterization of an energy landscape, and therefore it is straightforward to use conventional ICC measures if we discard the normalization factor in Eq. (26).See below for a preliminary analysis of ICCs.
Quantities d 1 and d 2 used for defining ND (= d 2 /d 1 ) are averages over 80 and 100 samples, respectively, of a discrepancy measure, such as d J .For the randomized data produced in the permutation test, averaging over 80 or 100 samples kills fluctuations in individual samples.Therefore, the standard deviations of d 1 and d 2 are small compared to if they were calculated as averages over fewer samples of randomized data.Then, the statistical fluctuation of ND is proportionately small such that ND for the randomized data is centered around 1 with a small standard deviation, which tends to make the ND for the original data significantly different from 1.The number of samples for calculating d 1 or d 2 , such as 80 or 100, is our arbitrary choice, and the statistical significance of the permutation test depends on the choice of these numbers.This is an important limitation of our permutation test.However, for the present fMRI data, we still obtain qualitatively similar, albeit statistically weaker, results even without carrying out any averaging.We show in Appendix C 11 the ND values for the original data and the p values from the permutation test when d 1 and d 2 are a single sample of a discrepancy measure (e.g., d J ). Tables 11 and 12 indicate that three out of the twelve combinations of the network and the discrepancy measure yield significant p values (i.e., less than 0.05/12 = 0.00417 uncorrected, considering the Bonferroni correction) for the MSC data when we estimate the energy landscape by the conventional method and the variational Bayes method, respectively.Table 13 also suggests that, out of the four discrepancy measures calculated for the HCP data with the conventional method, one measure yields a significant p value (i.e., less than 0.05/4 = 0.0125 uncorrected).
As another stress test, we briefly analyzed three measures of ICCs although we have already stated why we did not use them in our main analysis.As mentioned earlier, the average branch length, used for defining d L , is the only scalar characterization of an energy landscape employed in the present study.Therefore, we computed the average branch length for the wholebrain network obtained from each session of the HCP data and then the first two ICC measures, which are conventional and take a scalar value for each session as input.The third ICC measure assumes vector input for each session, so we use the vectorization of matrix J = (J ij ).We show the definition of the three ICC measures in Appendix D. The ICC value calculated by a standard method [77] is −0.0115.In general, negative values of the ICC are interpreted as zero reliability [78,79].The ICC value calculated for the average branch length by a second method [80] is 0.4542.This value is reasonably large [80].We also calculated I diff [23,25] as an ICC measure.We obtained I diff = 10.13% when the data for each participant-session pair is {J ij ; 1 ≤ i < j ≤ N }.This value is roughly similar to the I diff values for functional networks obtained from the HCP data in a previous study [23].The results for the last two ICC measures indicate a moderate reliability within a single participant relative to across different participants.In contrast, the first definition of the ICC does not support this result.Although reasons for the discrepancy are unclear, we believe that these preliminary results are a step to in-depth individual-level fingerprinting analyses in the future.
The estimated J matrix can be regarded as a functional connectivity matrix and can be better at estimating the structural connectivity than other conventional definitions of functional connectivity [67].The estimated h 1 , . .., h N are close to zero because our standard choice of the threshold (i.e., time average of the signal for each ROI i) makes each σ i to take −1 and +1 approximately with probability 1/2 each.Then, an energy landscape is almost completely determined by J.In this sense, the reliability of the energy landscapes in terms of the indices we have investigated is caused by the reliability of functional connectivity.However, our results are not direct consequences of known results of high reliability of functional connectivity in fMRI data [18,20,25] because they estimated functional con-nectivity using other conventional methods such as the Pearson correlation coefficient and their variants.Furthermore, which properties of energy landscapes, including J, bear higher reliability is not a trivial question.For example, Table 11 shows that the activity patterns at the local minimum of the energy, measured by d H , and the activity patterns averaged over the attractive basin, measured by d basin , are more reliable than J for the DMN.
Related to this issue, although we proposed four discrepancy indices for pairs of energy landscapes, they are our arbitrary choices.One can apply the analysis pipeline proposed in the present study to assess test-retest reliability for other discrepancy indices.Other potential discrepancy indices are the frequency of transiting from one particular local minimum to another and features of the transition probability matrix among the activity patterns or among the local minima.Furthermore, our framework of the permutation test on the ND value is not limited to energy landscape analysis (e.g., application to "microstate dynamics" for fMRI data [81]).
Individual variability of fMRI data has most frequently been investigated in terms of functional connectivity [21].In contrast, we have shown evidence that energy landscape analysis of fMRI data bears session-to-session reproducibility within a participant relative to between different participants.The present results encourage further work toward application of energy landscape analysis to identification of individuals in different cognitive, behavioral, and clinical conditions.

Ethical approval
All the data were open to the public and collected with the approval of the ethics committees or institutional review board of the recording institutes in the reference [22,59].The corresponding author's institutional review board generally assigns a determination of Not Human Research to such a case.

Author's contribution
TW and NM conceptualized the project.JN, SM, and TW collected the data.PK, JN, SM and TW curated the data.PK, TW, and NM developed the methodology.PK performed formal analysis and investigation.PK and NM validated the results.PK, and NM visualized the results.TW and NM collected the funding.NM administered and supervised the project.PK and NM prepared the first draft of the manuscript and other authors contributed to the manuscript revision.All authors read and approved the final version of the manuscript.

Appendix A: Sample disconnectivity graphs
We show a sample disconnectivity graph for the MSC data based on the concatenation of four sessions of a participant the concatenation of four randomized sessions in Fig. 6   In this section, we run the permutation test using a different randomization procedure [75,76].We first calculate d 1 and d 2 for the original data in the same way as described in Fig. 3(a) and then the ND value.However, when randomizing the data for the permutation test, we uniformly randomly permute the

Figure 1 :
Figure 1: Schematic of a disconnectivity graph showing the relationships between the activity patterns that are energy local minima.The arrow indicates the height of the energy barrier between local minima γ 1 and γ 2 from the viewpoint of γ 1 .
Figure 2 describes how to match between the local minima of two energy landscapes.

( 2 )
ℓ ′ the average of the activity patterns in the attractive basin of Ṽ (2) ℓ ′ in the second energy landscape.Then, we calculate the cosine distance between u

Figure 3 :
Figure 3: Schematic diagram describing concatenation of the fMRI data across different sessions and calculation of d 1 and d 2 .(a) For the original data.(b) For the randomized data.The inference of the energy landscape is based on the data concatenated across four sessions.The same four cells in the table are used for the concatenation in (a) and (b).However, because of the random permutation, any concatenation in (b) is over four sessions that are selected uniformly at random from the original data.Therefore, in (b), d ′ 1 and d ′′ 1 , for example, are statistically the same, and the expectation of d 1 and that of d 2 are the same.
2, for example.With m = 3, we concatenated sessions s = 1, 2, and 3 from the same participant into one time series, sessions s = 4, 5 and 6 into one series, and sessions s = 7, 8, and 9 to one series.With m = 4, we concatenated sessions s = 1, 2, 3, and 4 into one series and sessions s = 5, 6, 7, and 8 into another series.With m = 5, we concatenated sessions s = 1, 2, 3, 4, and 5 into one series and sessions s = 6, 7, 8, 9, and 10 into another series.The average and standard deviation were computed across the participants and the different manners to concatenate m sessions per participant.

Figure 5 :
Figure 5: Histogram of ND for the randomized data and the empirical ND value.The first, second, and third columns of the figure show the distributions for the whole-brain network, DMN, and CON, respectively.The four rows of the figure show the distributions for d J (in (a), (b), and (c)), d H (in (d), (e), and (f)), d basin (in (g), (h), and (i)), and d L (in (j), (k), and (l)), from the top to the bottom.In each panel, the vertical line indicates the empirical ND value.
. acknowledges support from the Japan Society for Promotion of Sciences (TW, 19H03535, 21H05679, 23H04217) N.M. acknowledges support from the Japan Science and Technology Agency (JST) Moonshot R&D (under grant no.JPMJMS2021), the National Science Foundation (under grant no.2204936), and JSPS KAKENHI (under grant nos.JP 21H04595 and 23H03414).

6 .
Data availability statement The two data sets used in this work are publically available.The first data set was provided by the Midnight Scan Club (MSC) project, funded by NIH Grants NS088590, TR000448 (NUFD), MH104592 (DJG), and HD087011 (to the Intellectual and Developmental Disabilities Research Center at Washington University); the Jacobs Foundation (NUFD); the Child Neurology Foundation (NUFD); the McDonnell Center for Systems Neuroscience (NUFD, BLS); the Mallinckrodt Institute of Radiology (NUFD); the Hope Center for Neurological Disorders (NUFD, BLS, SEP); and Dart Neuroscience LLC.This data was obtained from the OpenfMRI database.Its accession number is ds000224.The second data set was provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

Figure 6 :
Figure 6: Examples of disconnectivity graphs.(a) Concatenation of the first to the fourth sessions from participant MSC01.(b) Concatenation of the foursessions from the same participant after we randomly shuffle the sessions of the MSC data using the method depicted in Fig.3.We used the conventional likelihood maximization method.

10 .
Appendix B: Permutation test by shuffling the participants within each session

Table
For the whole-brain network, we The mean and standard deviation of

Table 2 :
Accuracy of fit of the pairwise MEM when we concatenate fMRI data across different participants.Each cell represents the accuracy of fit in percent when we concatenate the fMRI data across sessions from different participants.

Table 5 :
Discrepancy between two energy landscapes estimated by the variational Bayes method applied to the MSC data."Within" and "Between" in the table stand for within-participant and between-participant, respectively.We computed the average and standard deviation of d 1 and d 2 across the participants and across the sessions, respectively.

Table 6 :
ND values and the permutation test results for the four discrepancy measures, calculated with the variational Bayes method applied to the MSC data.The p values are the uncorrected values.