Reproducibility of graph measures at the subject level using resting‐state fMRI

Abstract Introduction Graph metrics have been proposed as potential biomarkers for diagnosis in clinical work. However, before it can be applied in a clinical setting, their reproducibility should be evaluated. Methods This study systematically investigated the effect of two denoising pipelines and different whole‐brain network constructions on reproducibility of subject‐specific graph measures. We used the multi‐session fMRI dataset from the Brain Genomics Superstruct Project consisting of 69 healthy young adults. Results In binary networks, the test–retest variability for global measures was large at low density irrespective of the denoising strategy or the type of correlation. Weighted networks showed very low test–retest values (and thus a good reproducibility) for global graph measures irrespective of the strategy used. Comparing the test–retest values for different strategies, there were significant main effects of the type of correlation (Pearson correlation vs. partial correlation), the (partial) correlation value (absolute vs. positive vs. negative), and weight calculation (based on the raw (partial) correlation values vs. based on transformed Z‐values). There was also a significant interaction effect between type of correlation and weight calculation. Similarly as for the binary networks, there was no main effect of the denoising pipeline. Conclusion Our results demonstrated that normalized global graph measures based on a weighted network using the absolute (partial) correlation as weight were reproducible. The denoising pipeline and the granularity of the whole‐brain parcellation used to define the nodes were not critical for the reproducibility of normalized graph measures.


| INTRODUC TI ON
Resting-state fMRI (rs-fMRI) is a task-free and an easy-to-use tool for neuroscientific data acquisition. It is used to detect spontaneous low frequency (<0.1 Hz) fluctuations of the brain by blood-oxygen-level-dependent (BOLD) signals during a state of rest (Biswal, Yetkin, Haughton, & Hyde, 1995;Smith et al., 2013). Those fluctuations are highly organized across discrete brain regions (Greicius, Krasnow, Reiss, & Menon, 2003). Functional connectivity analysis is a way to analyze how distant brain regions interact and graph theory can be used to quantify performance of the entire network (Bullmore & Sporns, 2009;Park & Friston, 2013;Sporns, 2011).
Many studies have reported that alterations of interactions among distant brain regions and dysfunction of networks are closely related to brain diseases, such as epilepsy (Burianová et al., 2017), Alzheimer's disease (Greicius, Srivastava, Reiss, & Menon, 2004;Hallquist & Hillary, 2018;Johnson, Sperling, & Sepulcre, 2013) and among others. Petrella (2011) proposed the use of graph metrics as potential biomarkers and diagnostic tools in clinical work. However, before graph measures can be widely applied as a biomarker in a clinical setting, it is critical that a number of properties of graph measures should be evaluated rigorously. These properties include simplicity, robustness, and testretest variability.
The reproducibility of rs-fMRI based graph measures faces a big challenge, namely how to effectively handle BOLD signals contaminated by noise (Bianciardi et al., 2009) and how to reduce the influence of noise on the reproducibility of graph measures. The main causes of this contamination are head motion and non-neuronal physiological fluctuations. Head motion, even very subtle movement, has been demonstrated to have a negative impact on BOLD signals (Parkes, Fulcher, Yücel, & Fornito, 2018;Satterthwaite et al., 2012Satterthwaite et al., , 2019. Therefore, the influence caused by head motion and non-neuronal physiological fluctuations should be removed as much as possible to reduce the impact on functional connectivity. Research in denoising techniques reducing the influence of noise in BOLD signals attracted quite some attention (Satterthwaite et al., 2019). Denoising techniques can have an impact on the reproducibility of graph measures. The most common strategies for denoising typically correct the signal using three types of information: (a) head movement parameters (Friston, Williams, Howard, Frackowiak, & Turner, 1996;Satterthwaite et al., 2013); (b) physiological signals derived from white matter and cerebrospinal fluid (Fox, Snyder, Vincent, & Raichle, 2007); and (c) a global signal regressor (GSR) (Fox, Zhang, Snyder, & Raichle, 2009;Murphy, Birn, Handwerker, Jones, & Bandettini, 2009). These strategies are often combined with temporal censoring of bad volumes that contain too much noise (Power, Barnes, Snyder, Schlaggar, & Petersen, 2012;Power et al., 2014;Satterthwaite et al., 2013).
Unfortunately, there has been no agreement on which strategy is superior to reduce the noise present in rs-fMRI images (Caballero-Gaudes & Reynolds, 2017). Therefore, we investigated to which degree reproducibility of graph measures depends on the denoising techniques.
Besides the challenge of reducing noise before network construction, network construction itself can influence the reproducibility of graph measures (Dimitriadis, Drakesmith, et al., 2017;Wang et al., 2011Wang et al., , 2014. However, this has not been systematically explored. Network construction consists of defining the nodes of the network and defining the connectivity measure and how to calculate the weights of the network from this measure (Rubinov & Sporns, 2010). Nodes are the main elements of a network. They can be defined using an atlas or parcellation (Desikan et al., 2006;Shen, Tokoglu, Papademetris, & Constable, 2013;Tzourio-Mazoyer et al., 2002). They can also be defined based on a set of a priori regions (Fritz et al., 2019) or based on a data-driven method (de Vos et al., 2018) or even be defined as the voxels themselves (Horn et al., 2019). Different definitions of nodes have different advantages and limitations (Arslan et al., 2018). Node definition may affect reproducibility but also the way functional connectivity between nodes is calculated (Liang et al., 2012). The latter is typically based on a Pearson correlation but also partial correlations can be used since it can remove the influences of other nodes (Marrelec et al., 2006;Smith et al., 2011). Both types of correlation can be used to derive graphs. However, a choice needs to be made how to treat negative (partial) correlations. One option is to use the absolute value representing the information shared between two nodes. Graphs based on the absolute value are graphs in which all functional information among nodes is presented. An alternative is to study graphs based on only the positive or negative (partial) correlations separately setting the other values to zero. However, the proportion of positive and negative connections in a real network may vary, which may induce some bias if we would take only either positive correlations or negative correlations into account. The effect of these choices on the reproducibility of graph measures has not yet been systematically investigated. The last element of network construction is the choice between a weighted and a binary network. In weighted networks, a selection of connections can be made, for example, using a soft-thresholding removing small weights and keeping the weights of the remaining connections or they can be used as fully weighted graphs in which no a priori selection is made (Li, Xue, Ellmore, Frye, & Wong, 2014;van den Heuvel, Mandl, Stam, Kahn, & Hulshoff Pol, 2010;Wang, Ghumare, Vandenberghe, & Dupont, 2017). For weighted networks, the weight can be obtained either from the raw (partial) correlations itself or from a transformed Z-value after a Fisher r-to-Z transform (Wang et al., 2017). In binary networks, selection of connections is based on significance or amplitude. Selected connections will get a weight of 1 while the weight of the others is set to 0 (Hosseini et al., 2013;Wang et al., 2014).
Although it can provide interesting information about brain functioning in normal or pathological conditions, it is not sufficient if we want to introduce these techniques in a clinical setting in which subject-specific graphs have to be constructed and quantified (Gordon et al., 2017;Poldrack et al., 2015). Therefore, we focused on subject-specific graph measures in this paper.
In this study, we comprehensively investigated how these factors mentioned above affect the reproducibility of subject-specific graph measures and investigated which combination gave reasonable results. We investigate the use of two denoising pipelines: a simple versus a more complex denoising pipeline (Ciric et al., 2017;Parkes et al., 2018). Furthermore, we investigated the use of the two functional connectivity measures (correlations vs. partial correlations), three different approaches to handle negative correlations (taking the absolute value, taking only positive values, or taking only the negative values), two types of network (binary or weighted), and two types of calculations of the weights of the network (based on the (partial) correlation values or based on a transformation of Z-values obtained after a Fisher r-to-Z transformation). Finally, we also considered two levels of granularity for the whole-brain parcellation (50 vs. 100 parcels per hemisphere). The overall aim was to answer the following questions: (a) are subject-specific graph measures reproducible and (b) what is the optimal pipeline for rs-fMRI based subject-specific graph measures?

| Dataset and ethical statement
We used multi-session fMRI datasets from the Brain Genomics Superstruct Project (GSP) (Holmes et al., 2015). This dataset
Before data preprocessing, we set the origin for all structural and functional MRI images close to the anterior commissure and if needed, adapted the orientation to make it more similar to the template in MNI space. This step only affected the transformation matrix and not the data itself since we did not reslice the images.
The standard preprocessing steps included: (1) The first four dummy scans of each run were removed; (2) all resting-state functional images were realigned to correct for head movement; (3) slice timing; (4) coregistration of the structural and mean functional image; (5) segmentation of the structural image which provides deformation fields to warp the data to MNI space; and (6) warping of the functional images to MNI space using the deformation field of each subject obtained in the previous step.

| Image quality control
Besides visually checking image quality by a neuroradiologist (QR) to make sure there were no apparent structural abnormalities or artifacts present, image quality of the rs-fMRI data was further assessed using the head movement parameters obtained during the realignment step (overall translation or rotation along any direction and framewise displacement) and the change in brain intensities between two consecutive volumes (Power et al., 2012(Power et al., , 2014.
Framewise displacement (FD) was calculated as (Power et al., 2012(Power et al., , 2014. in which Δd ix = d (i−1)x − d ix , and similarly for the other movement pa- Note that the rotation parameters should be expressed in radians and translation parameters in mm.
The volume by volume intensity changes across whole-brain voxels are calculated as (Power et al., 2012;Smyser et al., 2010).
x is the image intensity at locus ⃗ x in volume i, <.> denotes the spatial average over the whole-brain voxels (Power et al., 2012;Smyser et al., 2010).
In this work, we used a cutoff value for FD of <0.5 mm consistent with values reported by Waheed et al. (2016). The DVARS threshold was set (per run) at the mean of the DVARS values over time plus three standard deviations. The overall translation (across the run) was required to be <1.5 mm and the overall rotation <1.5 degrees (in all directions). When no 5 min interval was present with these characteristics, we applied scrubbing on those volumes which were outside these predefined limits. We only did this for at most 20 "bad" volumes. Scrubbing seems compelling to reduce head movement effects but at the cost of loss of temporal degrees of freedom (Ciric et al., 2017;Parkes et al., 2018). Furthermore, if too many volumes required scrubbing, it is not clear if the scrubbed data still represent the underlying connectivity. When none of the above was satisfied, we considered the data as of too low quality and that run was taken off-study.
To investigate the reproducibility of graph measures among sessions, we selected the first run of each session except when the first run was taken off-study in which case the second run was used.

| Pipelines for denoising
We selected two denoising pipelines: a simple approach on the one The head motion parameters were derived from the realignment step in the preprocessing. WM and CSF regressors were extracted in a WM and a CSF mask, respectively, which were calculated as the intersection between the subject-specific WM and CSF segmentations (thresholded at 0.9) and a priori masks of WM and CSF. The a priori masks were generated from the a priori masks from SPM (in case of CSF, we limited it to the lateral ventricles) by thresholding them at 0.5 and by applying an erosion to avoid being too close to GM.
The global signal regressor was calculated as the average BOLD time series across all voxels within a global brain mask defined as voxels in which the sum of the GM, WM, and CSF segmentation maps is more than 0.9 (Fox et al., 2007;Shirer, Jiang, Price, Ng, & Greicius, 2015). We included GSR in both simple and complex denoising pipelines.
In case scrubbing was required (i.e., when censored volumes were identified), the censored volumes were not taken into account in any of the preprocessing steps except for the band-pass filtering in which case they were marked, and their values were replaced by linear interpolation before the filtering. During the calculation of the functional connectivity, the censored volumes were not used. It is worth mentioning that we applied scrubbing of bad volumes in both denoising pipelines (Power et al., 2014) if required.
After defining all regressors, we applied a principal component analysis (PCA) to avoid multi-collinearity of regressors (Jackson, 1991).
After removal of the nuisance signals, band-pass filtering was applied (0.009-0.1 Hz).

| Network construction
A network consists of nodes and connections. The nodes in this study were derived from a whole-brain parcellation using different levels of granularity (Shen et al., 2013). We selected 50 parcels (Shen50) and 100 parcels (Shen100) per hemisphere for this work (https://www.nitrc.org/frs/?group_id=51). Each parcel was taken as a node of the network. and could be considered as a more specific measure for functional connectivity (Marrelec et al., 2006;Smith et al., 2011).
We studied two types of networks: a binary network and a weighted undirected network without self-connections. Binary networks were studied at different densities. The density should neither be too sparse nor too dense (Hosseini et al., 2013;Kaiser & Hilgetag, 2006). Therefore, we used densities ranging from 5% to 40% (Wang et al., 2014). Another way is to transform (partial) correlations to Z-scores using a Fisher r-to-Z transform (Finn, 1974): in which r indicates the (partial) correlation, n is the number of volumes, and p is the number of regions (which is 2 in case of correlations) (Finn, 1974;Wang et al., 2014). Weights are then calculated from the Z-scores as (Wang et al., 2017).
in which Φ is the cumulative distribution function of the standard normal distribution.
Similar as before, one can take all the Z-values (abs), set the negative Z-values to zero (pos), or set the positive Z-values to zero (neg) before calculating the weights w. Although the approach in which we calculate weights based on Z-scores is theoretically better, it suffers from two problems: The number of volumes should be greater than the number of parcels plus one in order to obtain real values when calculating partial correlations directly without any regularization approach. In the current dataset, the maximally available number of volumes was 120 for a whole run. Therefore, partial correlations were only obtained for the first level of granularity (50 parcels per hemisphere). The other problem is that the values of the Z-scores depend on the number of volumes used, and this has an impact on the graph measures themselves. This can be a problem when combining data from subjects with a different number of volumes.

| Graph analysis
Graph theoretical analysis was conducted using the brain connectiv- showing the highest betweenness centrality. If the hub score was at least 2, the node was considered a hub. The modularity structure was determined using Newman's algorithm using 100 realizations (Vandenberghe et al., 2013;Wang et al., 2014). This defines the probabilistic co-assignment matrix in which the probability is given that two nodes belong to the same module when applying Newman's algorithm. We also calculated normalized graph measures. A normalized graph measure is the ratio between the graph measure and the average graph measure of 30 equivalent random networks. Equivalent random networks have the same number of nodes and the same density (in case of binary networks) or the same distribution of connectivity values (for a weighted network) but connections are randomly assigned to each pair of nodes.

| Measures of reproducibility for graph measures
The reproducibility of graph measures (Wang et al., 2014)  The consistency of hubs is determined by measuring the co-occurrence (Hc) of hubs (Wang et al., 2014) which is based on the Dice coefficient.
where H 1 and H 2 are the list of hubs in the network of the first, respectively, and second measurements.
A value of 1 corresponds to a perfect agreement of hubs while 0 reflects no agreement at all.
To assess the consistency of the modularity structure, we used probabilistic scaled inclusivity (pSI) (Wang et al., 2014): Here, i and j indicate two nodes; P 1 (i,j) and P 2 (i,j) are the probabilistic co-assignment between nodes i and j in the network derived from the first and second measurement, respectively. A value of 1 corresponds to a perfect agreement of modularity while 0 reflects no agreement at all. Here, we calculated pSI for each node.

| Statistics
In this study, we investigate the effect of different factors on the The analyses were performed in three steps.
First, we used a 2 × 2 factorial repeated measures nonparametric analysis of variance using ART to assess TRT values of graph measures, in which the first factor was the denoising pipeline (simple vs. complex denoising) and the second factor was the type of correlation (Pearson correlation vs. partial correlation). For this analysis, we used the Shen50 parcellation and the absolute value of the (partial) correlations. We performed the analysis separately for binary networks at different densities and for weighted networks. In the latter case, we introduced a third factor based on how the weights were calculated (using the (partial) correlation values or using weights based on the Fisher r-to-Z transformation), and thus, we analyzed a 2 × 2 × 2 factorial repeated measures nonparametric analysis of variance.
Second, we performed a 3 × 2 factorial repeated measures nonparametric analysis of variance using ART (ARTool in R) in which the first factor was the handling of the negative correlations (absolute value, only positive values, and only negative values) and the second factor was the type of correlation (Pearson correlation vs. partial correlation). For this analysis, we used-based on the results of the first analyses (see below)-the Shen50 parcellation, the simple denoising pipeline, and weighted graphs in which weights were calculated based on the value of the (partial) correlations.
Third, we used ART to compare the effect of the level of parcellation (50 vs. 100 parcels per hemisphere). In this analysis, we used again the simple denoising pipeline and weighted graphs in which weights were calculated based on the value of the correlations. We also had to limit this analysis to Pearson correlations since partial correlations cannot be reliably calculated given the number of data points in the time series compared to the number of nodes.
An overview of all analyses is given in Table 1 and Figure 1.
The p-value was set at a Bonferroni corrected p < .05 to take into account multiple comparisons (the number of graph measures we have evaluated although graph measures are not completely independent). We did not correct for the number of analyses we performed, to prevent too much reduction of power. All statistical analyses were performed in Rstudio (version 3.6.0).

| Binary networks
In binary networks, the TRT variability for global measures was large at low density ( Figure 2, Table S1) irrespective of the denoising strategy or the type of correlation. Comparing the different strategies (analysis 1a, Table 1; Figure 1a), the type of correlation (Pearson correlation vs. partial correlation) has a significant effect on TRT (Table 2): Pearson correlations showed lower TRT values (Table S1).
The denoising pipelines (simple vs. complex denoising) did not have a significant effect, and there was no interaction effect between the denoising pipelines and the type of correlation (Table 2). Since TRT values were large at lower densities even for global graph measures (Table S1), we did not further investigated binary networks.

| Weighted networks
Weighted networks showed very low TRT values for global graph measures ( Figure 3, Table 3A, Table S2) (Table 3B). There was also a significant interaction effect between type of correlation and weight calculation (Table 3B). Similarly as for the binary networks, there was no main effect of the denoising pipeline.
Local clustering coefficient and nodal average path length showed relatively low TRT values but local betweenness centrality showed high TRT values (Figure 4, Table S3). Nodal efficiency also showed high TRT values in all cases except when weights were calculated based on transformed Z-values of the correlations.
Hub consistency was moderate with average values (for the different strategies) between 0.23 and 0.29.
The average consistency of the modularity structure was between 0.18 and 0.24.

| Binary and weighted networks constructed using orthogonal minimal spanning trees
We have also applied a data-driven topological filtering approach based on orthogonal minimal spanning trees (OMST) to construct a binary and a weighted network. This method was recently proposed

TA B L E 1 Statistical analyses
Analyses 1: Factorial repeated measures nonparametric analysis of variance using the Shen50 whole-brain parcellation and using absolute values of the (partial) correlations 1a binary networks: 2 × 2 design with factor 1: denoising pipeline (simple vs. complex) and factor 2: type of correlation (Pearson correlation vs. partial correlations) for different densities 1b weighted networks: 2 × 2 × 2 design with factor 1: denoising pipeline (simple vs. complex); factor 2: type of correlation (Pearson correlation vs. partial correlations); and factor 3: weight calculation method (based on the raw values of the (partial) correlations vs. based on transformed Z-values obtained after a Fisher r-to-Z transform) Analysis 2: 3 × 2 factorial repeated measures nonparametric analysis of variance using the Shen50 whole-brain parcellation using the simple denoising pipeline and weighted graphs in which the weights are based on the values of the (partial) correlations.

| Binary and weighted networks constructed using the Oxford-Harvard atlas
To ensure that the results were not critical depending on the Shen50 atlas, we also performed an additional analysis using the Oxford-Harvard atlas (Desikan et al., 2006;Frazier et al., 2005;Goldstein et al., 2007;Makris et al., 2006). The results (shown in Tables S7-S10) were in the same line as for the Shen50 atlas.

| The effect of type of correlation and the handling of negative values
In a second analysis (analysis 2, Table 1 (Table S4). There was also a significant interaction effect (Table   S4).
Hub consistency ranged from 0.22 to 0.29, and the average consistency of the modularity structure was between 0.17 and 0.26.

| The effect of parcellation granularity
In the third analysis (analysis 3, Table 1, Figure 1c), we used the simple denoising pipeline and weighted graphs in which weights were calculated based on the absolute value of the (partial) correlations. We only varied the number of parcels per hemisphere: 50 versus 100 by comparing the results for the Shen50 versus Shen100 parcellation.
Mean TRT values of all global graph measures obtained using the Shen100 parcellation decreased compared to the values obtained using the Shen50 parcellation (Table 4A). However, the difference was not significant (all p-value > .05) ( Table 4B). The distribution of TRT for all global graph measures for the Shen50 and Shen100 parcellation can be found in Figure 6.

| D ISCUSS I ON
In this study, we investigated the effects of denoising pipelines and network constructions on reproducibility of graph measures. We found that the choice of denoising pipeline did

| Denoising: simple denoising versus complex denoising
We used a simple and a complex denoising strategy by means of a model with a low or a high number of parameters. The parameters of the simple model included the six basic head movements and regressors for WM, CSF, and a GSR. This pipeline is widely applied for preprocessing of functional connectivity studies (Fox et al., 2009) (Parkes et al., 2018) which is very similar. In our study, we also found that the reproducibility of the graph measures of networks preprocessed using either the simple or complex denoising strategy was very similar and statistically not different.
Including the GSR, dominated by non-neural signals such as motion-related and respiratory noise rather than signals from regions of gray matter, has been demonstrated to noticeably decrease the impact caused by head movement-related artifacts and other physiological confounds (Power, Plitt, Laumann, & Martin, 2017). It had also been reported that including the GSR could introduce negative correlations (Fox et al., 2009;Murphy et al., 2009) and may have an impact on graph measures (Chen et al., 2018). In this work, we included the GSR based on the work of Parkes and Ciric (Ciric et al., 2017;Parkes et al., 2018). The combination of head movement parameters and physiological non-neural signal regressors and GSR is a relatively effective way to decrease the influence by noise. It improved the reproducibility of the functional connectivity among regions (Parkes et al., 2018). Additionally, we investigated differences in reproducibility between a simple and a complex denoising models at the subject level and in both denoising models, a global signal regressor was included.

F I G U R E 2 TRT (%) values of global graph measures in binary networks
. The x-axis indicates density (in percent) of the network ranging from 5% to 40%. S, simple denoising; C, complex denoising; Cor, Pearson correlation; Parco, partial correlation; λ n , normalized characteristic path length; C n , normalized clustering coefficient; E n , normalized efficiency; and BC n , normalized betweenness centrality

| Definition of correlation: correlation versus partial correlation
In this study, we found that graph measures based on Pearson correlations showed lower TRT values versus those based on partial correlations. Once nodes have been defined, another important question to consider is how to quantify the interaction among these spatially distinct regions through neurobiologically interpretable quantities. The most commonly used functional interactions are based on correlations (Kirino et al., 2019) or partial correlations  between the time course of two nodes. Functional connections based on correlations between two spatially distinct regions could be driven by other regions (Wang et al., 2014). Functional connections based on partial correlations avoid this caveat, and they are more related to effective connectivity (Marrelec et al., 2006;Smith et al., 2011). Therefore, they are a more appropriate method (Dawson et al., 2016;Wang, Kang, Kemmer, & Guo, 2016) to detect biologically interpretable alterations of graph measures under different disease conditions.
This might also explain the higher TRT values of graph measures based on partial correlations: They could provide more biologically meaningful information and therefore are more dependent on the brain state.

| Weight selection
We demonstrated that the weight selection affected the TRT values of graph measures. How to generate weights when applying a weighted network analysis has not been systematically investigated until now. We explored its effect on TRT of graph measures through combining it with other network construct factors. The weights were derived either using raw (partial) correlations or using Z-values obtained after a Fisher r-to-Z transform. The weight distribution depends on this choice.
When looking at differences at the functional connectivity level, it is mandatory to use a Fisher r-to-Z transform in order to perform a correct statistical analysis. However, this transformation depends on the number of data points that is used. In this study, the calculation of graph measures was based on the functional connectivity. As a result, the weights when calculated using these Z-values also depend on the number of data points. As long as all data have the same number of data points, this is not a fundamental problem; but when

| Binary versus weighted networks
Binary networks can be constructed independently of the method to define weights since they are typically studied at different densities and the selection of the highest connections based on raw (partial) correlations, transformed Z-values, or weights is the same since these methods preserve the order of highest connections. However, we have shown that the reproducibility of binary networks at the subject level is low especially at lower densities such as 5%-10%. This is not surprising since at lower densities, each change in connection has a more dramatic impact on the graph measures. Group based binary networks are more reproducible (Wang et al., 2014) but in the current study, we focused on the use of graph measures derived from single subjects since this is the preferred method to use in a clinical setting (Xiang et al., 2019).

| (Partial) correlation values: absolute, positive, and negative
When using (partial) correlations as measure of functional connectivity between nodes, one needs to decide how to handle negative values. In most studies, either the absolute value is used or the connectivity is limited to only those connections with a positive value while neglecting the negative (partial) correlations (Kazeminejad & Sotero, 2019). However, negative correlations may contain important biological information, which cannot be neglected (Kazeminejad & Sotero, 2019). We found that the reproducibility of graph measures for networks based on only positive or only negative values were worse compared to taking the absolute value. Furthermore, the proportion of positive and negative (partial) correlations in a given network may vary, which may induce bias if we take only either positive or negative correlations into account. Therefore, we recommend using the absolute value, because connections with negative correlations are taken into account and the absolute value can be interpreted as the amount of information shared between two regions. The only drawback is that we cannot distinguish between positive or negative correlations with the same amplitude.  et al., 2018). In our study, we have used a brain parcellation obtained from connectivity-driven analyses using spectral graph theory (Shen et al., 2013). These parcellations were available for different levels of granularity, which made it easier to study the effect of granularity.

| Granularity of parcels
Using parcellation schemes which were based on different criteria would be an extra confounding factor. Furthermore, there is currently no golden standard for which parcellations to use.

| Hubs and modularity
The hubs and the modularity structure were only partially consistent at the subject level. This may be partially explained because of the low reproducibility of the local graph measures. Another possibility is that the identification of hubs and modules is affected by different brain states present during the resting-state experiment (Kabbara et al., 2019). As a result, the hubs and modules may be different during different brain states which may explain the limited reproducibility. This hypothesis has to be tested by further investigations.

| Limitations
There are a number of limitations in this study. First, we did not investigate all possible denoising pipelines, because the effect of denoising pipelines on functional connectivity has been studied previously (Ciric et al., 2017;Parkes et al., 2018

| CON CLUS IONS
We systematically investigated the reproducibility of graph measures of brain networks defined at the subject level as potential biomarkers in clinical work. The denoising pipeline did not affect the reproducibility. The reproducibility of graph measures of individual binary networks was insufficient especially when the density of the network was low. This was also the case for the reproducibility of nodal graph measures based on weighted or binary networks.
For weighted networks, using the absolute value of the (partial) correlation as weight, was the method of choice and the reproducibility does not critically depend on the level of granularity.

CO N FLI C T S O F I NTE R E S T
The authors declare no competing financial interests.

AUTH O R CO NTR I B UTI O N S
QR, RV, and PD designed the study; QR and PD performed the research; PD made the scripts used in this study; QR, TJ, JS, KM, RV, and PD wrote the paper and critically checked the content.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from http://neuro infor matics.harva rd.edu/gsp/.  F I G U R E 6 TRT (%) values of global graph measures based on different granularity. The x-axis indicates the level of granularity. 50 and 100 relate to the 50, respectively, and 100 parcels per hemisphere used in the brain parcellation. BC n : normalized betweenness centrality; C n : normalized clustering coefficient; E n : normalized efficiency; λ n : normalized characteristic path length