Computational Pipeline for Analysis of Biomedical Networks with BioNAR

In a living cell, proteins interact to assemble both transient and constant molecular complexes, which transfer signals/information around internal pathways. Modern proteomic techniques can identify the constituent components of these complexes, but more detailed analysis demands a network approach linking the molecules together and analyzing the emergent architectural properties. The Bioconductor package BioNAR combines a selection of existing R protocols for network analysis with newly designed original methodological features to support step‐by‐step analysis of biological/biomedical . Critically, BioNAR supports a pipeline approach whereby many networks and iterative analyses can be performed. Here we present a network analysis pipeline that starts from initiating a network model from a list of components/proteins and their interactions through to identifying its functional components based solely on network topology. We demonstrate that BioNAR can help users achieve a number of network analysis goals that are difficult to achieve anywhere else. This includes how users can choose the optimal clustering algorithm from a range of options based on independent annotation enrichment, and predict a protein's influence within and across multiple subcomplexes in the network and estimate the co‐occurrence or linkage between metadata at the network level (e.g., diseases and functions across the network, identifying the clusters whose components are likely to share common function and mechanisms). The package is freely available in Bioconductor release 3.17: https://bioconductor.org/packages/3.17/bioc/html/BioNAR.html. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
The technology available to support biological experiments has made rapid advances in recent years, with massive steps forward in both the sensitivity and throughput of methods to analyze biological samples across multiple levels.The most widely known advances are those in the high-throughput sequencing of DNA and RNA, but there have been step changes in a wide range of areas, including proteomics, metabolomics, and connectomics.Integrating these datasets into network models allows us to elucidate key functional interactions, pathways, and complexes that are required for healthy function and that, when perturbed, often lead to disease.
Proteomic data are typically represented via static undirected protein-protein interaction (PPI) networks, in which vertices represent the proteins and edges represent the molecular interactions.From a PPI network, one can derive many statistical measures of topology and fundamental properties and use these to gain insight and make predictions about the underlying data (Bánky et al., 2013;Vidal et al., 2011).For example, "scale-free" (Barabasi & Albert, 1999) properties and small-world paths, found in many biological networks, in combination with various centrality measures, such as degree and betweenness, have been widely used to identify "hub" molecules, which often encode diseaserelated proteins (Vidal et al., 2011).
Protein networks obtained from modern proteomic studies are usually large (thousands of proteins) and contain the components of multiple signaling pathways, linked together by other proteins whose roles in the network are less well understood.A common approach is to divide PPI networks into communities (or clusters) based on their connecting architecture, which can be achieved with multiple clustering algorithms.
Proteins in a network model are usually annotated with functional or/and disease information.Given the topological structure from clustering, it is often useful to test how these annotations are distributed throughout the network: Do they tend to concentrate in specific network communities?Do some annotations share common subnetwork patterns?To study this, disease and/or functional associations can be mapped over PPI vertices.Functional and disease enrichment of clusters can then be calculated using a suitable hypothesis driven test, i.e., the hypergeometric test, and tested against a permutation study ( McLean et al., 2016) to reveal the clusters within the network that are significantly enriched for specific annotations.This kind of approach has been extensively applied in our set of case studies and was used to link together neuronal diseases (typically highly polygenic) onto the molecular pathways and clusters in the synapse proteome (Pocklington et al., 2006).In summary, a common and important task in biological network analysis is in testing how well the network topology (interconnectedness) correlates with distribution of specific functional (or dysfunctional/disease) annotations (Fernandez et al., 2009;Klemmer et al., 2009;Zhu et al., 2007).
We designed BioNAR to provide a universal topologically based network analysis pipeline deployable on high-performance computing environments that enables users to load networks generated and/or annotated using their lab's own metadata and perform

of 31
Current Protocols Figure 1 The pipeline for a typical network analysis study is shown, along with dependencies between the respective protocols (described in detail in the main text).SP, Support Protocol.
a step-by-step network analysis with respect to their specific scientific questions, while also making the tool as widely applicable and flexible as possible.
Six main protocols presented here form a procedural guide for network analysis using BioNAR (Fig. 1).Basic Protocol 1 describes how to initiate a network model, import relevant datasets, and annotate the network with various metadata.This results in a base network model, on which all the other analyses in the pipeline are then based.Support Protocol 1 explains how to install the BioNAR package using R integrated development environment (IDE) RStudio.Basic knowledge of R language is assumed.
In Basic Protocol 2, the raw network model is analyzed as a whole and its general properties, e.g., "scale-freeness" and major network centralities, are calculated and compared against those of randomly generated network models of the same size.Basic Protocol 3 describes how to subdivide the network into communities using a selection of nine available algorithms that have been used in a range of published studies.Basic Protocol 4 further compares the results of different clustering algorithms and estimates their effectiveness based on a general assumption that biologically plausible clusters are more likely to show community enrichment with relevant functional annotation terms.Basic Protocol 5 then shows how to estimate the vertex property bridgeness based on the consensus clustering matrix, and how to rank the vertices based on their topological importance within the network.This can be used to identify candidate linkages between functional compartments in the model.Finally, Basic Protocol 6 demonstrates how to test the topological overlap of different annotations, which may indicate shared mechanisms that are only visible using the network context.
Given our previous and ongoing work is related to synaptic proteome, we illustrate the package's functionality with a postsynaptic network model, derived from the publicly available synaptome.dbpackage (version 0.99.12), which is described in Support Protocol 2.

of 31
Current Protocols

NETWORK INITIATION AND ANNOTATION
This protocol describes how a network is initiated in the BioNAR package and how annotations are added for further analysis.BioNAR implements networks as R data frames, in which each row corresponds to a vertex interactor pair and each vertex has a unique identifier (vertex_ID).Therefore, a network can be directly uploaded as a data frame in this format.Alternatively, a network can be imported from most standard graph file formats, including .gml and other formats supported by the igraph R package.BioNAR also allows network import for synaptic protein sets/synaptic compartments/brain regions directly from the synaptome.dbpackage (Sorokina et al., 2022); this is described in Support Protocol 1.For constructing the protein interaction networks described here, we selected the NCBI Gene Entrez ID to use as a unique "vertex_ID" for each node/protein.
Many graph-related algorithms behave poorly when applied to graphs containing disconnected components.Consequently, most biological networks-including PPI networksare constructed to retain only the network's largest connected component of vertices.We follow this practice for downstream analysis as well, but BioNAR also stores the disconnected components, as they may be of importance to the user at a later stage.
Vertices are typically annotated with categorical or continuous metadata.Annotations are usually handled in a three-column data frame format, where the first column contains the annotation term ID, the second column the annotation term name, and the third column the associated vertex_ID.All annotation terms for the same vertex_ID are collected and converted into semicolon-separated lists to store all annotations of the vertex held as a string in the vertex annotation.For example, if a protein is annotated with two different molecular functions A and B, in vertex annotation it will be stored as "A; B." In proteomic networks, annotations typically include Gene Ontology (GO) annotations and gene-disease association values (GDAs), and the direct import of both of these types of annotations from public databases is implemented in BioNAR.However, the package supports any custom annotation, including gene-expression values, pathway membership data, and so on.As an example of adding custom data, this protocol describes the addition of new binary annotation for the SynGo ontology (Koopmans et al., 2019).BioNAR is also designed to allow the user to assign the results of any vertex calculation as a new vertex attribute.The ability to retain such calculated supports reproducibility, as many algorithms used in network analysis have a stochastic component, so that multiple invocations of the same analysis can create a distribution of results, which can now all be stored for later analysis.

of 31
Current Protocols 1. Install RStudio and BioNAR package as described in Support Protocol 1.
Open RStudio and launch the library BioNAR, as well as a few more frequently used libraries, assuming they have not been installed by default.Navigate (using the command setwd) to the directory in which you have stored the network and annotation files (in our case, it is called "Sample network").The commands summary(gg) and summary(g) will show the same graph composition from the text import (step 2a) or .gmlimport (step 2b), respectively: 2297 vertices and 9406 edges, with the Human Entrez ID used as the node name for all three graphs.
3. Annotate the network with GeneNames.In this step, the raw graph from step 2 are annotated by importing annotation data from the human genome-wide gene annotation database org.Hs.eg.db.For each node in the graph, the Entrez Gene Name is assigned using the annotateGeneNames function. g<-annotateGeneNames(g) Sometimes gene IDs become obsolete and gene names are not assigned.We can check for this: If the result is TRUE, we can find such genes and assign its GeneName manually.Some functions further down require unique GeneNames to function correctly.
4. Annotate the network with metadata.In this step, the raw graph from step 2 is annotated by importing annotation data from relevant external files and/or databases.For each node in the graph, the following annotation is added: associated human diseases are extracted directly from the Human Disease Ontology (HDO) using the anno-tateTopOntoOVG function and Gene Ontology terms (BP, MF, and CC) are extracted using the annotateGOont function.

of 31
Current Protocols g <-annotateTopOntoOVG(g, dis) g <-annotateGOont(g) summary(g) After calling summary(g), you will see that nodes now have attributes: ), and GO_CC (v/c), where id corresponds to internal graph ID, name -Human Entrez ID, TopOnto_OVG is the disease name according to the Human Disease Ontology (HDO), TopOnto_OVG_HDO_ID is the disease ID according to the HDO, GO_MF_ID is the Gene Ontology Molecular Function ID, GO_MF is the Gene Ontology Molecular Function name, GO_BP_ID is the Gene Ontology Biological Process ID, GO_BP is the Gene Ontology Biological Process name, GO_CC_ID is the Gene Ontology Cell Compartment ID, and GO_CC is the Gene Ontology Cell Compartment name.(v/c) and (v/n) indicate that the corresponding attributes are character and numerical values associated with vertices.
To annotate a graph with your own metadata, you need to provide it in the form of a data frame (see example "SynGO.txt"),where the first column should contain vertex IDs matching name IDs in the network (Human Entrez ID) and the second column the respective attribute values.When using the command annotateVertex(g), you will need to provide the name of your new attribute as the parameter name ("syngo" in our case) and the name of your annotation table as the parameter values ("sg").

INSTALLING BIONAR FROM RSTUDIO
Support Protocol 1 describes how install BioNAR from Bioconductor.3. Open RStudio and install the Bioconductor package BioNAR using the following commands.

of 31
Current Protocols BiocManager::install("BioNAR") library(BioNAR) This will check for and then install the Bioconductor package manager "BiocManager" if it is not already installed.It will then install BioNAR from Bioconductor.
During the installation process, BiocManager may determine that some packages are obsolete; in this case, it will ask: Old packages: 'x.y.z', … Update all/some/none? [a/s/n]: If this is the first package you are installing, we suggest typing "n"; if you've already have a working R environment, it is wise to type "a".

BUILDING THE SAMPLE NETWORK FROM SYNAPTOME.DB
BioNAR includes a direct plug-in to the Synaptic Proteome database, which contains 64 independent proteomic studies of mammalian synapses, recently curated into a single dataset describing a landscape of ∼8000 proteins (Sorokina et al., 2021).BioNAR does this through the Bioconductor package synaptome.db,which allows users to obtain the corresponding gene information, such as subcellular localization, molecular interactions, brain region, gene ontology, and disease association, and to construct customized proteinprotein interaction network models for gene sets and entire subcellular compartments (Sorokina et al., 2022).
Support Protocol 2 describes how to extract a custom PPI network for postsynaptic compartment based on a gene set comprising proteins that were found in three or more independent postsynaptic studies.

of 31
Current Protocols This will check for and then install the Bioconductor package manager "BiocManager" if it is not already installed.It will then install synaptome.dbfrom Bioconductor.
During the installation process, BiocManager may determine that some packages are not the most recent version; in that case, it will ask: Old packages: 'x.y.z', … Update all/some/none? [a/s/n]: If it is the first package you are installing, we suggest typing "n"; if you already have a working R environment, it seems wise to type "a".
2. Extract the compartment ID for postsynaptic compartment.To do this, navigate using setwd to the directory where you will store all the results; in our case this is "Sample network".The command match will return the database ID for required compartment, which is "1" in our case (for the postsynaptic compartment).
In our example, the gene list for the postsynaptic compartment is large (5568 genes), and we want to reduce it to the highest-confidence set of genes, found in at least three (arbitrary decision) different studies, which results in 2643 genes.To get this, we specify "3" for the findGeneByCompartmentPaperCnt command. 5. Build the PPI for the selected set of genes.
Alternatively, it is possible to create a network of "induced" type, in which case an expanded network including all possible interactors of the listed genes will be obtained from the database.This can be created with the synaptome.dbcommand getIGraph-FromPPI; this functionality is outside the scope of this paper (Sorokina et al., 2022).The resulting graph contains the one connected component consisting of 2297 vertices and 9406 edges, which we saved in a gml format to use for further analysis.

NETWORK PROPERTIES AND CENTRALITY
To assess any network model for biological plausibility, we often want to test a network's degree distribution for evidence of scale-free structure and compare it against an equivalent randomized network model.The assumption here is that a network based on noisy data will have more random architecture, whereas biological networks typically tend towards a more scale-free structure.For this we use the R PoweRlaw package (version 0.50.0;Gillespie, 2015), which deploys a goodness-of-fit approach to estimate the lower bound and the scaling parameter of the discrete power-law distribution for the optimal description of the graph degree distribution.
Although degree is the most commonly used vertex metric, there are other centrality metrics that can be estimated from the graph structure.BioNAR directly supports the calculation of the following network vertex centrality measures.Some are implemented via igraph (Gabor & Tamas, 2006): degree (DEG), betweenness (BET), clustering coefficient (CC), and page rank (PR; see igraph manual for details).Other measures appear only in BioNAR: semilocal centrality (SL), mean shortest path (mnSP), and standard deviation of the shortest path (sdSP).
Calculated vertex centrality values can be added as vertex attributes (calcCentrality) or returned as an R matrix (getCentralityMatrix), depending on user preference.Any other numerical characteristics, calculated for vertices and represented in a matrix form, can also be stored as a vertex attribute (applyMatrixToGraph).
To enable comparison of an observed network's vertex centrality values to those of an equivalently sized randomized graph, we enabled three randomization models: the G(n,p) Erdös-Rényi model (Erdös & Rényi, 1959; "gnp," illustrated in the example below), the Barabasi-Albert preferential attachment model (Barabási & Albert, 1999; "pa"), and the derivation of a new randomized graph from a given graph by iteratively and randomly adding/removing ("cgnp") or rewiring edges ("rw").
For proteomic networks with matching multi-condition gene-expression data, scale-free structure has also been demonstrated by using the expression data to perform a perturbation analysis on the network to measure network entropy (Teschendorff et al., 2015).Originally, this kind of analysis was designed for comparing a control with a perturbed network (e.g., wild-type versus mutant, untreated versus treated), where vertices with low entropy rate appear to be the most important players in disease propagation.
However, the assessment of scale-free structure does not actually require gene-expression data, as it based solely on the network topology.In BioNAR, following the procedure described by Teschendorff et al. (2015), all vertexes are artificially assigned a uniform weight and then sequentially perturbed, with the global entropy rate (SR) after each protein's perturbation being calculated and plotted against the log of the protein's degree (see Theoretical Appendix for more detail).In case of scale-free or approximate scalefree topologies, we see a clear bimodal response between over-weighted vertices and their degree and an opposing biphasic response in under-weighted vertices and their degrees.
Basic Protocol 2 starts with estimating and fitting the observed graph degree distribution to a power law and comparing this against a random network.At the next step, we estimate the graph entropy, again comparing it to the random graph.Finally, we estimate the range of centrality metrics for our graph and a randomly generated graph; in our example, we use the Erdös-Rényi model.

Check the network degree distribution.
pFit <-fitDegree(as.vector(igraph::degree(graph=g)),Nsim=1000, plot=TRUE,WIDTH=2480, HEIGHT=2480) pwr <slot(pFit,'alpha') This command will take the graph created in Basic Protocol 1, fit it to a power-law distribution, and return a figure consisting of a log-log plot of the cumulative distribution function (CDF) of postsynaptic PPI network degree distribution (Fig. 2A).During fitting, code performs a bootstrapping hypothesis test to determine whether a

Current Protocols
Figure 2 (A) The results of a fit of network degree distribution (P(k)) versus its degree (k) for the postsynaptic network.The red line shows the best-fit power-law distribution with estimates of the scaling parameter α (2.63) and k min (13), with a goodness of fit of 0.32 after 1000 iterations of the bootstrapping.(B) Distribution of alpha values for 5000 random G(n,p) Erdös-Rényi graphs, with actual α value (2.63) shown as a vertical line near the left.
power-law distribution is plausible.The number of bootstrapping rounds is defined by the parameter Nsim, which is set to 100 by default.Setting the Nsim parameter to large values makes calculation slower: for example, with Nsim = 1000 it takes couple of minutes to get a fit.By using a slot function, we extract scaling coefficient α to compare with the random graph results.
2. Compare network degree distribution with a set of random graphs.
The following commands will generate a sample of 5000 random G(n,p) Erdös-Rényi graphs and estimate α coefficients of their best fits to the power-law distribution.To save compute time, we reduce the number of bootstrapping tests to 10 in this example; however, it still takes about hour and a half to complete the task, using 500 samples instead of 5000 to save time.The histogram of obtained values is plotted in Figure 2B, where the vertical line on the left indicates the value (2.63) for the actual graph.

Calculate graph entropy.
As we do not have actual expression data for our graph, we can use simulated data (see the Theoretical Appendix for more detail).The command getEntropyRate estimates the global entropy rate for the whole network, whereas getEntropy estimates entropy value for each network vertex.The command plotEntropy visualizes the results as a plot (Fig. 3A).We can see that entropy for both up-and down-perturbed lowdegree vertices stays close to that value, whereas the hubs (high-degree vertices) deviate from it significantly, which makes for a bimodal response between gene overexpression and degree and opposing biphasic response relative to over-/underexpression between global entropy rate and degree, observed only in networks with scale-free or approximate scale-free topology.(B) SR values for a randomized G(n,p) Erdős-Rényi network, perturbed in a similar way.
g <-calcCentrality(g) The command calcCentrality assigns centrality measures the following node attributes: DEG, degree; BET, betweenness; CC, clustering coefficient; SL, semilocal centrality; mnSP, mean shortest path; PR, page rank; and sdSP, standard deviation of the shortest path.Rather than saving these centrality values on the graph, e.g., to provide different names for the vertex centrality attributes, they are obtained in matrix form: mc <-getCentralityMatrix(g) The matrix will contain eight columns for each of the measures above, with the rows corresponding to each node's database ID (2297 nodes in total, synaptome.dbversion 0.99.12).

Calculate the graph centrality measures for a randomized graph.
ggrm <-getRandomGraphCentrality(g, type = c("cgnp")) We need to select the type of randomization, "cgnp" in our case, that corresponds to a sampling algorithm, which will randomly perturb the graph adjacency matrix and shuffle its vertices in such a way that the correlation between new and old adjacency matrix is 75%.

of 31
Current Protocols
In this protocol, first, community structures are obtained for the network used in previous protocol.The user can select some or all of the available clustering algorithms to run simultaneously.This community structure(s) can then be visualized.The process can also be iterated by re-clustering the largest communities after each cycle (the threshold for communities to be re-clustered can be defined).Finally, we extract a summary table that compares across all the clustering algorithms used in the analysis.
1. Use the network g obtained with previous protocols.

Cluster the network.
We provide two functions, calcClustering and calcAllClustering, that use calcMembership to calculate community memberships and store them within the graph vertices attributes named after the algorithm.The difference between them is that calcAllClustering allows the user to calculate memberships for all clustering algorithms simultaneously (this can be slow, especially on larger graphs, e.g., up to 30 min for our full example network), and store them as graph vertices attributes, whereas calcClustering allows users to select a specific algorithm.
The user also has the option to select one or more specific clustering algorithm to run over their network, because running all clustering algorithms over the large network might be time consuming.For instance, to cluster the network g using only the Louvain algorithm, the command would look like: Note that your own summary may look slightly different, as some of the algorithms use randomization, so each time they will give slightly (but not significantly) different results.
For comparing output across different clustering algorithms on a network, a summary matrix is created, consisting of: the maximum modularity obtained (mod), the number of detected communities (C), the number of singlet communities (Cn1), the number of communities with size ≥100 (Cn100), the fraction of edges lying between communities (mu), the sizes of the smallest community (Min.C) and the largest community (Max.C), and the average (Mean C), median (Median C), first quartile (1st Qu.C), and third quartile (3rd Qu.C) of community size (Table 1).
4. Visualize the community structure results.
The BioNAR package provides functionality to visualize a network's community structure with an in-built cluster-driven layout algorithm, which is suitable for networks up to tens of thousands of vertices and millions of edges.This layout splits the network into clusters, lays out each cluster individually, and then combines individual layouts with the igraph function merge_coords, so that each distinct community is shown independently and painted in a unique color.In our example case, we show the community structure for the results obtained by the Louvain clustering algorithm, which resulted in 15 communities (see Table 1).For visualization purposes, we need to extract the membership in a form of a data frame.A palette is provided using dis-tinctColorPalette command from the package randomcoloR, which defines an individual color for each community; the layoutByCluster command calculates the layout.Additionally, you can define edge color, size of the node, and position of the legend (Fig. 4A).McLean et al.

of 31
Current Protocols 5. Re-cluster the obtained community structure (optional).
A common phenomenon when applying modularity-based clustering algorithms over networks of a large size is to end up with a small number of larger, or "super-," communities, masking any substructures within these super-communities.In this situation, we provide the user with the facility to re-cluster these large/super-communities in a hierarchical manner, applying the same, or potentially a different, clustering algorithm at each iteration (re-cluster).We also need to specify the threshold for cluster size, which will be retained during re-clustering procedure-10 in our case-and then we use the same algorithm again.
remem<-calcReclusterMatrix(g,mem.df,alg = "louvain",10) head(remem) This will return a table where each vertex name (1st column) will be assigned a cluster membership in the original graph (2nd column) and another cluster membership in the re-clustered graph (3rd column).In the example case of Louvain clustering results, the original clustered graph had 15 communities, and the re-clustered graph had 127.
6. Visualize the re-clustered community structure.The re-clustered graph is shown at Figure 4B.

IDENTIFYING THE OPTIMAL CLUSTERING ALGORITHM
The algorithms considered in Basic Protocol 3 are based on a range of different mathematical approaches and will give different results for the same network (as can be clearly seen in Table 1).However, for further analysis, presentation, or publication, it is usually necessary to select one clustered network model based on a single algorithm.In

of 31
Current Protocols most cases, there is no ground truth, and the choice of the "correct" or "best" clustering algorithm is subjective.While accepting that there is no "correct" or "best" clustering approach, however, we can at least assess which algorithms identify "useful" clusters.To do this, we rely on an assumption that proteins/molecules that are well known to cooperate in the same complex/molecular pathways should be more likely to be found together in the same cluster.Thus, we can use our functional annotations (such as GO annotations) to help identify clustering methods that fit that assumption.
To achieve this, we estimate the number of enriched communities for each algorithm and for each annotation term, considering the network's size, the size of each cluster, and the number of annotated genes in the network and in the clusters (see Theoretical Appendix for background details that underpin the method).
The protocol starts by estimating the overrepresentation of annotation terms in each community discovered by each clustering algorithm.Overrepresentation analysis (ORA) is a common approach to identify annotation terms that are significantly over-or under represented in a given set of vertices compared to a random distribution.In biological networks, GO terms and pathway names are amongst the most frequently used.ORA differs from Gene Set Enrichment Analysis (GSEA) in that the latter use numerical values associated with genes, such as expression values, whereas the former relies on presence/absence data utilizing null hypothesis tests, such as the hypergeometric test.To keep the package as general purpose as possible, we used the Bioconductor package fgsea to implement ORA functionality on top of arbitrary string vertex annotation and vertex grouping, obtained by clustering (but not limited by it).We represent the results of ORA as an R data frame, with rows representing the combination of annotation term and cluster and columns-the enrichment characteristics, including the size of the cluster (Cn), the number of annotated vertices in the graph (Fn), the number of annotated vertices in the cluster (Mu), the odds ratio (OR) and its 95% confidence interval (defined by CIl and CIu), and the fold enrichment (Fe and FC).We also provide p-value, adjusted p-value, size of overlap, and the list of vertices from the cluster that contribute to the annotation term.Using the odds ratio allows us to distinguish functionally enriched communities relative to functionally depleted communities.
If we were to rank algorithms simply according to the percentage of functionally enriched communities, it would not tell us anything about the size of community the enrichment originates from-that is, whether the enrichment occurs within extremely large or small communities.Therefore, we plot the fraction of functionally enriched communities greater or equal to the log of its Fe value, measured at 100 intervals taken from 0 to the maximum Fe value (the maximum found from all algorithms studied).Because Fe values take into consideration the size of communities, at this step we can exclude enrichment from communities at the extreme sizes.In general, functional enrichment of non-extreme communities is observed by those clustering algorithms following a sigmoid distribution (see Fig. 5A).Thus, we test how well each distribution fits to a generalized sigmoid function using the two-sample Kolmogorov-Smirnov (KS) test to the goodness of fit of each distribution to our set of five idealized sigmoid curves (see Theoretical Appendix for technical detail).
To reproduce clustering results analyzed in this step, we recommend using the example network, built and stored in the external file PSD_annot_cls.gml in BioNAR.The network in this file is the same as in previous protocols but contains the clustering results already assigned to each node.1. Load the network from BioNAR.file <system.file("extdata","PSD_annot_cls.gml",package = "BioNAR") gg <igraph::read.graph(file,format="gml")#graph from gml summary(gg) 2. Estimate enrichment for the one clustering algorithm and one annotation.
If we want to take a look at overrepresentation for a specific algorithm, we need to select the algorithm and then use the function clusterORA to perform overrepresentation analysis for the results obtained with specified algorithm.We also need to specify the annotation set that we will use for assessing the clustering results.Here, we use GO Molecular function annotation ("GOMFID"), but it could be Biological Process or any other annotation, stored as a node attribute in our graph.
To compare all the clustering results, we need to select all algorithms that we want to compare.Then, using the function clusterORA, we perform the overrepresentation analysis on the results obtained with the specified algorithms.As before, we specify the annotation: GOMFID.Finally, we need to calculate FeMax, which corresponds to the maximum Fe value, and FcMax, which corresponds to the maximum Fc value.These values will be needed for further analysis to specify the required range of Fe and Fc values.

Analyze unadjusted p-values for enrichment and print summary table (optional).
The command summaryStats will combine the results of enrichment obtained in the step 1, which will produce a list of tables: "SUM," "SUM2," "SUM3," "SUM4," and "CAN," where the "SUM" table contains main summary that can be used for detailed analysis of algorithm/term pairs (which is outside the scope of this protocol) and "SUM2," "SUM3," and "SUM4," containing the auxiliary values for further fitting to sigmoid curves.
The key table for the enrichment analysis is "CAN," which records the annotationterm-to-cluster-association data for each clustering algorithm.This shows the user which clusters an enriched annotated term is associated with.As illustrated by the example below, "CAN" consists of four columns, where "ALG" corresponds to the name of the algorithm, "Fn" to the respective GO term ID, "C" to the ID of the cluster, and "Mu" to the number of the genes in the cluster annotated with this term (the cardinality of the pair; see Table 2 for the first six rows of the "CAN" table).

names(statsR1)
View(head(statsR1$CAN)) 5. Plot the fraction of enriched communities and rank the algorithms.
The command plotRatio creates a rank table for the algorithms and four plots, p1-p4, all showing the distribution of the fraction of enriched communities against fold enrichment (Fe).p1 and p3 show the distribution plotted against log2(Fe), whereas p2 and p4 use log(log2(Fe)).p1 and p2 highlight the top three distributions, whereas p3 and p4 (shown in Fig. 5B) plot all algorithms each with a different color.Table 3 and Figure 5B show that the Louvain algorithm gives the most useful distribution (for the example given), followed by sgG2 and fc.
This command fits a sigmoid distribution (described in the Theoretical Appendix) to the fraction of enriched communities against Fe values for each clustering algorithm (Fig. 6).The function creates a list called fitres, which contains the fitting results.
To visualize the fitting results, you need to print gridplot, which is the part of fitres.

Current Protocols
Figure 7 Results of fitting to five generalized sigmoid functions with added noise.Each panel, (A)-(I), illustrates a different clustering method.The confidence interval is wider for wt (B) than in Figure 6, and is unbounded for infomap (D), sgG5 (H), and spectral (I), which indicates that fitting for results of these algorithms is unstable.See also legend to Figure 6.
To view the fitting results, you will need to select fitInfo.They will be presented in the form of a table with the following columns: "alg," corresponding to the algorithm's name; "isConv," indicating whether the fit has converged; "finTol," for the final fitting error; and "stopMessage," providing a message describing the reason to stop fitting.You can also assess the results of two-sample Kolmogorov-Smirnov (KS) test by selecting "ks" below.Here the columns will correspond to the p-values from the Kolmogorov-Smirnov test of correspondence between Fe distribution and sigmoid function rates.To test the robustness of the fit, we can add some noise to results obtained in previous step.We added noise to each data point by randomly sampling from a Gaussian distribution with mean 0 and a standard deviation of [0.01, 0.05, 0.1, 0.5].For this we must specify the level of noise (here, 0.05) while executing the commands in

of 31
Current Protocols step 4. The resulting fitting results with noise and values can be assessed in a similar way (Fig. 7).It can be seen that the fc, Louvain, and sgG1 algorithms give the more robust results.We should note here that users should be cautious not use the same datasets (directly or indirectly) for generating the clustered network and testing it.In our example here, we used GO terms, which are not linked to the methods or datasets used in either the clustering methods or the construction of the network architecture, including selection of nodes and edges.

IDENTIFYING THE INFLUENTIAL PROTEINS WITH BRIDGENESS
Not all proteins act similarly in propagating signals, or information, through a network.It is often assumed that proteins that interact with many partners have more significant impact on signal propagation or on disease mechanisms when perturbed.It is also generally found that the majority of proteins interact with just a few neighbors, and thus their contribution is generally predicted to be less impactful (Vidal et al., 2011).The importance of a protein in propagating information appears to be dependent on its nearest neighbors, as well as its ability to influence other communities relative to the one to which it most likely belongs (Nepusz et al., 2008(Nepusz et al., , 2012)).The bridgeness metric reflects the probability that a vertex could belong to more than one community at the same time, thus providing a useful measure for ranking the vertices based on their topological influence and formation of linkages between clusters in the network model (Nepusz et al., 2008(Nepusz et al., , 2012)).
The protocol starts from the testing of the robustness of communities obtained in the "most useful" algorithm, obtained in Basic Protocol 3.For this, the consensus matrix is produced by creating smaller network by randomly keeping a proportion (by default 80%) of the network edges (type = 1) or vertices (type = 2) and rerunning the clustering algorithm on largest connected component of that network.This is then repeated (by default, 500 times) to produce a distribution (matrix) of clustered networks.This new matrix is used to estimate the bridgeness, which takes values between 0, implying that a vertex clearly belongs to a single community, and 1, implying that a vertex forms a "global bridge" across every community in the network with the same probability (Nepusz et al., 2008(Nepusz et al., , 2012)).
Although useful itself as a measure of a "global" network importance, bridgeness becomes more informative when combined with other vertex centrality measures, such as semi-local centrality.Semi-local centrality considers the nearest and next to the nearest vertex neighbors, so reflects the "local" importance of the protein.It also lies between 0 and 1 and indicates whether the vertex is likely to have local influence.Plotting bridgeness against semi-local centrality allows us to categorize both the local and global influence of each vertex within a network given only the network structure.BioNAR also supports the comparison of bridgeness against any vertex centrality measure (or any normalized numeric vertex value) of the user's choice, e.g., against page rank.
The protocol for this starts with estimating the bridgeness from the consensus matrix obtained in Basic Protocol 4 and proceeds by plotting the bridgeness against the centrality measure of choice, which is implemented in two ways.In this example, we continue working with the network from Basic Protocol 4 and the algorithm that showed the best enrichment performance there, i.e., Louvain.

of 31
Current Protocols 1. First, calculate the consensus community structure: Here, "alg" selects the clustering algorithm to be used, "type" the sampling scheme (1 sample edges and 2 sample vertices) used, "mask" the percentage of edges or vertices to remove from the graph, "reclust" whether re-clustering should be performed on the community set found, "Cnmin" minimum cluster size, and "Cnmax" the maximum cluster size above which re-clustering will be performed (if reClust = TRUE).The resulting matrix conmat has dimensions of 2297 × 2297, where each matrix element is assigned the frequency with which a pair of nodes vertices is found in the same cluster.
2. Calculate the robustness of community structure.To do this, assign to each cluster a value in a range from 0, indicating no confidence in the community existing, to 1, indicating absolute confidence in the cluster existing, thus evaluating the "goodness" of a chosen clustering algorithm.

Calculate the bridgeness.
br<-getBridgeness(gg,alg = "louvain", conmat) head(br) The command getBridgeness takes the graph and consensus matrix (conmat) from step 1 as an input and returns the bridgeness results in a form of table with three columns, where the first column contains ID (Human Entrez ID), the second GENE.NAME (Human Gene Name), and the third BRIDGENESS.louvain(bridgeness values obtained with the Louvain algorithm).Next, assign bridgeness values as vertex attributes: g<-calcBridgeness(g,alg = alg, conmat) vertex_attr_names(g) For convenience, bridgeness values will also be stored as vertex attributes.

Plot bridgeness against the semilocal centrality (SL).
If we want to highlight the proteins of interest on the plot, we can specify their IDs as VIP vertices.In our case we will highlight the synaptic proteins with known function "Protein cluster," based on a classification, extracted from an unrelated published study (Lips et al., 2012), which is also stored in BioNAR as an external data file.

of 31
Current Protocols By plotting bridgeness against semi-local centrality (Fig. 8), we have categorized the influence each protein found in our network has on the overall network structure: • Region 1 (top left): Proteins with a "global" rather than "local" influence in the network (vertices in this region have also been referred to as bottleneck bridges, connectors, or kinless hubs; 0 < SL < 0.5; 0.5 < B < 1).
To plot the same figure in an interactive manner (to see the names for non-highlighted dots), the function ggplotly from the plotly library can be called as follows: library(plotly) gp<-plotBridgeness(g,alg = "louvain",

IDENTIFICATION OF OVERLAPPING ANNOTATIONS
This protocol describes how to identify which annotations overlap, and where within the network different annotations tend to overlap (or, conversely, are distinct and separate).
For example, within large-scale molecular networks, disease-associated genes are often found closely linked to one another (referred to as disease modules by Menche et al., 2015, and based on the shortest path in the network linking the disease terms), and the composition of these modules can be compared across different diseases.Disease annotations that tend to overlap in these modules also tend to show significant similarities at the level of gene coexpression patterns, clinical phenotype, and comorbidity.Conversely disease annotations residing in separated network neighborhoods appear to be more phenotypically distinct.
This phenomenon is not restricted to diseases and can be generalized, so that given two annotations distributed across a network, a common query would be to find the points of intersection where the two annotation sets overlap (or segregate).To support such queries, we implemented the algorithm from Menche et al. (2015), which tests whether the observed mean shortest paths between two distinct annotation sets, superimposed on a network, is significant compared to a randomly annotated network.

of 31
Current Protocols compares it with that of a randomly reannotated graph.This can be useful for generating a qualitative overview of the relationships between the annotations.
We do not need the cluster structure for this analysis, so the annotated network from Basic Protocol 2 (g) is sufficient to perform Basic Protocol 6.
1. Calculate the annotation pairs.
The BioNAR command calcDiseasePairs calculates the observed overlap between two annotation sets on a network, and compares this to a single instance of the network with annotations randomly permuted; this is useful for a qualitative estimate of how likely an overlap is simply a random occurrence.
To calculate annotation overlap, we need to specify the name of the annotations.
In our example, it is TopOntoOVGHDOID, which contains the HDO IDs for diseases.We will also need to specify the IDs for which the overlap will be calculated."r" corresponds to random permutations (other options are "none," in which case no permutation will be applied at all, and "binned," in which case the permutation process tries to take into account the node degree, which is slightly longer than random).The command pander will return a snapshot of an 8 × 8 table, where for each of disease pair there is a value that ranges from negative (indicating potential overlap) to positive (indicating separation).
2. Calculate the significance of the annotation overlap.
To calculate the significance of observed overlaps (or separations) of the observed annotation pairs in the network, the command runPermDisease is used.This compares the overlap against multiple permutations of the network (where the user can define the number of permutations).Executing this command may take considerable time, depending on the number of permutations chosen.It generates a results table containing the overlap of each annotation pair with p-value, p adjusted by Bonferroni test, and q-value.
In our simple example below, we selected 100 permutations, but for better significance, there should be 10,000 permutations (requiring ∼1 hr of computing time for this size of network and this set of annotations).The pander command will return a table containing a number of rows equivalent to the number of annotation pairs and the following columns: HDO.ID.A, the HDO ID for the first disease; N.A., the number of genes associated with this disease; HDO.ID.B, the HDO ID for the second disease; N.B., the number of genes associated with second disease; sAB, a value that characterizes the overlap (negative for overlap and positive for separation); Separated, indicating whether the pair is separated based on sAB; Overlap, indicating whether the pair overlaps based on sAB; zScore, the Z score for this pair of annotations (see the formula in Menche et al., 2015); p-value, the respective p-value for the overlap; Separation/Overlap.than.chance,indicating whether the ratio of the separation and overlap is greater than expected by chance; Bonferroni, the Bonferroni value for the overlap; p.adjusted, the p-value adjusted for multiple testing; and q-value, the q-value adjusted by the FDR.An example of the

COMMENTARY Background Information
Communication, or information transfer, between the components of a network is a common feature across many biological scales.Therefore, it is not entirely surprising that methods designed to analyze information flow in other types of networks have proven to be well suited to the analysis of biological networks.The analysis of social networks has proven to be a rich hunting ground for useful methods, from which many have translated well to the analysis of molecular interaction networks (Li et al., 2011;Ma & Zeng, 2003;Wunderlich & Mirny, 2006).
Many individual software tools have been developed to address the basic steps required for the type of analyses described above.For example, the igraph package in R supports building a network, estimating popular centrality measures, and several types of clustering (Gabor & Tamas, 2006).Cytoscape (Shannon et al., 2003), and its plug-ins support graphical representation and reconstruction of molecular networks and provide a number of approaches to extract interactions from a variety of sources-for example, the STRING interaction database (von Mering et al., 2003)and perform clustering and estimation of the main centrality measures.Various other tools exist for functional enrichment analysis based on the GO, KEGG (Kanehisa & Goto, 2000), and Reactome (Gillespie et al., 2022) ontologies, such as the widely used DAVID package (Sherman et al., 2022), GOrilla (Eden et al., 2009), BiNGO (Maere et al., 2005), and GSEA (Subramanian et al., 2005), to name a few.The Bioconductor (Huber et al., 2015) package ClusterProfiler allows estimation of the annotation overrepresentation for the clusters within a network (Wu et al., 2021), and the fgsea package provides fast enrichment analysis against an arbitrary set of annotations (Korotkevich et al., 2021).
BioNAR combines the features above and adds new ones, reflecting the research the authors and their collaborators have been involved with in recent years.The methods presented here will work with any other similar type of graphical data.The protocols are described here for undirected graphs, but some of the methods will work fine on directed graphs, and incorporation of the graph directionality into package code is ongoing effort.The vertices are unweighted, although some methods will work on weighted graphs; however, the current package has not been tested on them.Finally, BioNAR does not currently delve into probabilistic graphical models.All three of these are avenues for future development, and as BioNAR is entirely open source, there are no restrictions on end-user customization or extension.

Critical Parameters
Users of these protocols should ensure that they are using a recent version of R and BioNAR.At the time of this writing, R (4.2.2) and BioNAR (1.2.0) are used.

Troubleshooting
Some potential errors and their respective solutions are presented in

Understanding Results
Each of the protocols described here has been illustrated using a walk-through example from the mammalian synaptic proteome (see Fig. 1).In brief, we extracted a consensus list of proteins from a curated and published database of synaptic proteomics studies.We filtered the database to extract a list of proteins that were found in studies looking at the postsynaptic density in mouse, rat, or human and were reported by at least three different published studies.The same database also contains a curated list of direct protein-protein interactions between these proteins extracted from public databases.
Basic Protocol 1 walks the user through the initial generation of a network model containing vertices (proteins) and edges (proteinprotein interactions).For your own network, you will need both these elements to build the network.One word of caution is that it is important to consider how well defined the interactions are: Are they based on direct molecule to molecule data?Many interaction databases comprise a mix of direct and indirect interactions, where indirect interactions mean that there may be undefined intermediate part-ners involved in the interaction between two species.These are factors that need to be considered depending on the application area.Alternatively, a model can be assembled using any other network package and saved in .gmlformat for import into BioNAR.The advantage of using BioNAR in this step is that it can be scripted for multiple instances of networks from larger datasets.
Basic Protocol 2 then provides a comprehensive suite of methods to analyze the network structure.The first of three subprocedures calculates a set of commonly used core network parameters that are generally used to assess whether the network looks real or could be contaminated with random noise (biological network models do typically contain at least some random noise).BioNAR offers two methods to measure how scale-free a network is, operating under the assumption that real biological networks tend towards a scalefree rather than a random architecture.One is based on comparing the observed network against a distribution of generated random networks of the same size and the other is based on measuring entropy within the network.
Basic Protocol 3 then applies a set of frequently used clustering algorithms to the raw network generated in Basic Protocol 1. BioNAR enables the use of a hierarchical approach to the network in which large (defined Current Protocols by user) clusters can be treated as networks in their own right and clustered again to dissect substructures within.Finally, the clustered network can be visualized graphically within BioNAR (Fig. 4), or alternatively a clustered network can be exported to be visualized in a package of choice.
Basic Protocol 4 considers the tricky problem of which clustering algorithm is "best."There are few, if any, reliable ground truths, and clustering algorithms all behave differently as networks change in size and/or density of connections.Rather than call any algorithm "better," we prefer the term "useful."The basic concept here is that we usually do know something about the networks we are working on.In the case of protein interaction networks, there are well-known classic biochemical pathways, and the molecules within these pathways tend to share annotation terms in GO and generally directly interact with one another.Therefore, it is not unreasonable to expect these terms to be enriched within specific clusters, and indeed we do see this in clustered networks.BioNAR provides a suite of tools to compare across these networks and plot statistics that enable the user to assess how useful different clustering algorithms might be in terms of producing clusters that reproduce the small elements of prior knowledge we do have.Of course, any assessment based on optimizing enrichment will depend on the quality of the annotations used and how relevant to biology.In all cases, it is vital to report exactly a clustering algorithm was selected and share the network from Basic Protocol 1) to allow others to try alternative algorithms.
Most clustering methods are based on a fundamental assumption that a node exists in exactly one cluster within the network.However, molecules can move, and they exist in in multiple copies, which permits different instances to be involved in different places at the same time.Further, the methods used to identify molecules or their interactions often lack full molecular resolution and identify a parent/canonical molecule rather than differentiating between all the variants expressed in a cell or tissue sample.As a partial move towards a more probabilistic network model, we incorporated bridgeness into BioNAR in Basic Protocol 5.By sampling the network and re-clustering multiple times, we can estimate how likely each molecule is to belong to each specific cluster.Although many vertices can be assigned fairly easily and are robust to perturbation, we find others that are much more susceptible to perturbation and may reflect molecules that are pleiotropic in their influence and interact with multiple pathways or molecular complexes.When we combine this measure with a local influencing metric, we can start to tease out molecules that are core to specific clusters versus molecules that have broad influence over the entire network.
Finally, we included Basic Protocol 6 to allow the users to examine relationships between annotations in the content of the network they have.This does not use the clustering in Basic Protocols 3-5 but rather builds off the initial annotated raw network in Basic Protocol 1.The results from this approach can provide fundamental insights into why annotation terms (such as genes associated with disease) often correlate despite there being no obvious similarity at the sequence level.When we overlay this onto the molecular interaction network, we can reveal where in the network these interactions occur and identify other molecules that are intimately associated with the correlation that might underpin a shared mechanism of disease.Put more simply, the molecules with shared function might be different at the sequence level but they tend to be close within the network.Conceptually this is the reverse of Basic Protocol 4.

Theoretical Appendix
The following provides a more detailed discussion of the technique's theoretical foundations but can be ignored by most readers.Readers without a strong computational background should be able to perform and understand the protocol without relying on this appendix.

Estimating entropy effects in the network
We test for evidence of structure in each network by performing an entropy-based perturbation analysis in a synthetically weighted network.In this analysis, the global entropy rate (SR) of the network is measured after each gene is individually perturbed, by either overexpression (SR_UP) or underexpression (SR_DOWN), and is plotted against the degree of the perturbed gene.To assign expression values to each vertex in the perturbation analysis, we followed Teschendorff et al. (2015) (implemented in BioNAR as a default procedure).Vertex weights are set to initial values of 2 and then sequentially perturbed to values of 14 when modeling activity, and are set to initial values of 16 with perturbed values of -14 when modeling inactively.

of 31
Current Protocols

Comparing cluster enrichment to identify the optimal algorithm
The hypergeometric distribution was used to calculate the significance of enrichment of each cluster for each annotation (equation ( 1)): where N is the total number of genes in the network, Cn is the number of genes in the community, F is the total number of functional annotated genes in the network, and μ fc is the number of functional annotated genes per community.
Enrichment was calculated using equation (2), which gives us the one-sided exact Fisher test, and depletion was calculated using an alternative test given in equation (3).The calculated p-values were corrected using the Benjamini and Yekutieli (B-Y) procedure (Benjamini & Yekutieli, 2001) and tested against the more stringent Bonferroni correction at the 0.05 (*), 0.01 (**), and 0.001 (***) significance levels.
Given the enrichment results, we can calculate the log of the odds ratio (OR) as: where the terms are the same as those stated above.
The fold enrichment (Fe) value, measured at 100 intervals taken from 0 to the maximum Fe value (the maximum found from all algorithms studied), is estimated as follows: where N is the total number of genes in the network, Cn is the number of genes in the community, F is the total number of functional annotated genes in the network, and μ fc is the number of functional annotated genes per community.
For each algorithm, we measure the difference in fraction of enriched communities between log2(Fe) cut-off values of 0.5 (which translates to interval point 7 out of 100) and 4.8 (which is the interval point 54 out of 100).
Finally, we test how well each distribution is fitted to a generalized sigmoid function: where a gives the sigmoid's lower asymptote, b the sigmoid's maximum value, c the sigmoid's rate of change, and d the × value of the sigmoid's midpoint.An ideal sigmoid for our case would occur when a = 0, b = 1, c = -2, and d = 3.Of the four parameters, it is the rate of change of the sigmoid curve that is of interest to us, as lower and upper asymptote are defined by nature of the fraction value as 0 and 1.To test how well each distribution (in Fig. 7) fit to a set of five 'idealized' sigmoid functions, with the other parameters fixed and c set to [-10, -5, 2, -1, -0.5], we apply the twosample Kolmogorov-Smirnov (KS) test to the goodness of fit of each distribution to our set of five idealized sigmoid curves.

Figure 3
Figure3Global entropy (SR).(A) SR for the postsynaptic network under study plotted against the log of the protein's degree.The horizontal dashed line corresponds to the entropy rate of the unperturbed graph (SR 0 = 0.698).We can see that entropy for both up-and down-perturbed lowdegree vertices stays close to that value, whereas the hubs (high-degree vertices) deviate from it significantly, which makes for a bimodal response between gene overexpression and degree and opposing biphasic response relative to over-/underexpression between global entropy rate and degree, observed only in networks with scale-free or approximate scale-free topology.(B) SR values for a randomized G(n,p) Erdős-Rényi network, perturbed in a similar way.

Figure 4
Figure 4 Louvain clustering visualized with BioNAR.(A) Results of Louvain clustering.(B) Results of re-clustering for Louvain algorithms, in which several of the larger clusters from A are now split to smaller clusters.

Figure 5
Figure 5 Fold enrichment.(A) A schematic representation of possible scenarios of Fe distribution.The blue curve corresponds to the case in which most cluster-term pairs have small Fe values, either because large clusters cover large part of the network or because only a tiny fraction of the vertices are annotated.The red curve corresponds to the case in which the algorithm produces a lot of small clusters, so that the majority of cluster-term pairs have very large Fe values.To avoid both extreme scenarios, we try to find clustering algorithm with Fe distribution close to the black curve.(B) Plot (type p3) of the fraction of enriched communities against the Fe for each of the nine algorithm clustering results obtained for the postsynaptic network.Each color corresponds to the specific algorithm.

Figure 6
Figure 6 Results of fitting to five generalized sigmoid functions.Each panel, (A)-(I), illustrates a different clustering method.In each, four gray sigmoid curves correspond to four reference sigmoid curves with rate parameter c (see Technical Appendix) equal to [−10, −5, −1, −0.5], whereas the solid black sigmoid defines the ideal/desired behavior with rate equal to −2.The open circles and red line correspond to the actual fit, while the dashed blue line indicates the 95% confidence interval.Panels (C) and (E) show that the fc and Louvain algorithms give the closest fit to the ideal, with KS values of 1E-01 and 9e-02.McLean et al.

Figure 8
Figure 8 Plot of bridgeness (B) against the semilocal centrality (SL) for Louvain clustering results.Highlighted are the proteins with known function "protein cluster."Of the highlighted proteins, only a few belong to region 1 and have global impact on the network (PICK1, CASK, PASCIN1, MPP3).The rest belong to region 3 and largely influence their own communities.

Table 1
Summary of the Clustering Algorithms Used

Table 2
First Six Rows of 'CAN' Table

Table 3
Results of Ranking of the Algorithms Based on the Proportion of Enriched Communities The table is obtained from the View(plots$ranktable) command.

Table 4 .
All functions in BioNAR have built-in help, accessible by typing ?function_name.In

Table 4
Sources of and Solutions to Potential Errors addition, documentation exists at http:// www.qiime.org,and the help forum at http:// forum.qiime.org is typically quite active and useful.