SEARCH

SEARCH BY CITATION

Keywords:

  • biogeography;
  • biological communities;
  • graph theory;
  • microbial ecology;
  • network analysis;
  • population genetics

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

The recent application of graph-based network theory analysis to biogeography, community ecology and population genetics has created a need for user-friendly software, which would allow a wider accessibility to and adaptation of these methods. EDENetworks aims to fill this void by providing an easy-to-use interface for the whole analysis pipeline of ecological and evolutionary networks starting from matrices of species distributions, genotypes, bacterial OTUs or populations characterized genetically. The user can choose between several different ecological distance metrics, such as Bray-Curtis or Sorensen distance, or population genetic metrics such as FST or Goldstein distances, to turn the raw data into a distance/dissimilarity matrix. This matrix is then transformed into a network by manual or automatic thresholding based on percolation theory or by building the minimum spanning tree. The networks can be visualized along with auxiliary data and analysed with various metrics such as degree, clustering coefficient, assortativity and betweenness centrality. The statistical significance of the results can be estimated either by resampling the original biological data or by null models based on permutations of the data.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

Network analysis based on graph theory has turned out to be an invaluable tool for exploring the structure of many complex systems in diverse range of fields from sociology to economy and from physics to cell biology (Alon 2003; Ueda et al. 2004; Newman 2010). In this approach, it is assumed that most of the complexity of the system can be captured by the topology of a network formed by a set of nodes – or agents – that are connected to each other by links (see Table 1 for a glossary of terms). In addition to domain-specific characteristics, most networks display universal features, such as clustered and modular structures (Lancichinetti et al. 2010), the small-world property (Watts & Strogatz 1998) and broad connectivity distributions (Barabási & Albert 1999). The topology of a network can be crucial for the function and robustness of the system (Albert et al. 2000) and dynamics taking place on top it (Barrat et al. 2008). Network topology can also suggest possible mechanisms by which the network has evolved to its current state (Dorogovtsev & Mendes 2003).

Table 1. Glossary of terms used to describe network topology
NetworkSet of nodes (or vertices) connected by links (or edges)
Weighted networkA network where a weight is associated with each link. The weights can represent genetic similarity
Neighbour of a nodeA node connected to the focal node
DegreeThe number of links connected to a node, that is, the number of neighbours
PathA sequence of adjacent links
Shortest pathThe path between two nodes that requires traversing the smallest number of links
ComponentA set of nodes where paths exist between all nodes
Clustering coefficientThe ratio between existing and possible links between a node's neighbours, c= 2ei/[ki(ki−1)], where ei = the number of links between neighbours of node i and ki = degree of i
AssortativityThe tendency of high-degree nodes to connect to other high-degree nodes can be measured by calculating the Pearson correlation coefficient between degrees of connected nodes
Betweenness centralityA measure of the importance of a node (or link) in connecting other nodes through shortest paths. Formally, the fraction of all shortest paths going through a node (or link)
ThresholdingRemoving links with weights below a given threshold from a weighted network, so that only the most important links are retained
Percolation thresholdThe critical fraction of links that needs to be removed to break the network into disconnected components. Often, the composition of these disconnected components is informative

In a distinct way, graph-based network theory provides a promising approach for studies in ecology and evolution (Bascompte et al. 2003; Proulx et al. 2005; Hernández-García et al. 2007). Population genetic data (Multi Locus Genotypes) can be turned into networks among individuals (Hernandez-Garcia et al. 2006; Rozenfeld et al. 2007; Becheler et al. 2010) or among some predetermined populations (Rozenfeld et al. 2008; Fortuna et al. 2009) by considering each pair of individuals or populations connected if they are genetically similar enough. At both these levels, network theory has proven to be a useful method to analyse population genetic data and to help unravel ecological and evolutionary processes acting at local and regional scales. It has recently been proposed that the structure of ecological networks may illustrate relationships between communities based on ecological dissimilarities of their taxonomic composition (species composition or the presence/absence) and that the analysis of the topology of such networks may be used to define biogeographic provinces and to reconstruct their history of divergence (Dos Santos et al. 2008; Moalic et al. 2012). In addition, the network approach was recently successfully tested to illustrate and analyse the relatedness and clustering of both eukaryotic MOTUs (molecular operational taxonomic units) and microbial OTUs forming communities characterized by new generation-sequencing technologies (data reanalysed from Aires et al. 2013).

Network-based methods of data exploration are free of many of the ‘a priori’ assumptions, such as geographic clustering (genetic similarity spatially close populations), which usually underlie the population genetic interpretation of molecular data, as well as some ecological data analysis. In addition, the tools and indices developed in the framework of network theory allow unravelling unique properties such as the importance of each agent (individual, population or community) in populations, metapopulation or biogeographic systems (Rozenfeld et al. 2008; Becheler et al. 2010; Moalic et al. 2012). Finally, networks offer a natural way to graphically present inherently multidimensional data such as genetic relationships. This is an advantage over the classical methods based on phylograms, or trees, in cases where some of their assumptions, such as binary branching or absence of loops, are known to be violated due to the reticulate nature of relationships. To this extent, the rationale behind this kind of network analysis converges with the objectives of sequence, or haplotypes networks (Posada & Crandall 2001; Huson & Bryant 2006) proposed to illustrate genes evolution taking into account the uncertainties in mutational pathways or possibilities of sporadic reticulate events (such as recombination, lateral transfer or hybridization). Contrastingly, the methods proposed here are adapted from network analysis developed in the framework of graph theory and aim at unravelling the history and dynamics of naturally reticulated systems of interconnected communities, populations or individuals through the analysis of species distribution or population genetic data.

Until now, most network analysis for biogeography and population genetics has been performed using ad hoc scripts and separate network visualization tools such as Pajek (Batagelj & Mrvar 2002). Here, we introduce EDENetworks, a user-friendly software package that makes network analysis and visualization accessible to a wide range of researchers. EDENetworks has been developed for constructing and analysing ecological and evolutionary networks starting from genetic or ecological data. It implements a straightforward pipeline that standardizes network construction and analysis in this context (Fig. 1). This should facilitate a more widespread use of network methods in the community of ecologists and population geneticists, as well as provide tools for constructing ecological and evolutionary networks for network scientists. Further, for assessing the statistical significance of findings, EDENetworks provides a way for constructing randomized reference ensembles of networks that are based on biologically motivated null models. In the null models, randomization takes place at the level of source data, by randomly shuffling either alleles among individuals or samples among populations. This is in contrast to the purely structural null models commonly used by network scientists (e.g. the configuration model for randomly rewiring networks; Newman et al. 2001) that do not correspond to any biologically motivated null hypothesis, because the randomization takes place only after the source data have been processed into a network representation. Finally, although EDENetworks already includes a wide variety of analysis methods, the user can choose to export the networks and to analyse them with some general-purpose network analysis tools such as Gephi (Bastian et al. 2009), Cytoscape (Smoot et al. 2011) or igraph (Csardi & Nepusz 2006).

image

Figure 1. Schematic overview of the workflow.

Download figure to PowerPoint

Data input and export formats

EDENetworks can handle a wide range of data types in simple delimited text file formats as described in the manual. Ecological distance networks between communities can be constructed from data matrices of the presence/absence or abundance of species, eukaryotic MOTUs or microbial OTUs in the characterized communities. Genetic distance networks between individuals and populations can be constructed from data matrices of genotypes of individuals. In addition, EDENetworks can read precomputed distance matrices (e.g. when the user wants to experiment with a distance metric that is not available in EDENetworks) or files that directly contain network structures (e.g. in Graph Markup Language). The user can also provide an input file containing auxiliary data for the nodes, which can contain, for example, individual or community labels, geographic locations or custom colour codes that can be used for network visualization.

All results of network analysis can be exported as image files or text files that can be read with any standard spreadsheet or text-processing software. The networks themselves can be saved in standard file formats or visualized and saved as images in vector or raster formats. For further visualization with external software packages such as Gephi, the layout coordinates used in network visualization can be saved in a text file.

Analysis

The analysis pipeline in EDENetworks is shown in Fig. 1 for various data types and is described in more detail below:

  1. Data input and distance matrix construction: The user provides an input file and chooses the type of data it contains. For some data types, it is possible to automatically infer the exact format of the data (e.g. if the distance matrix is upper or lower triangular). The distance/dissimilarity metric is chosen from a list appropriate distances for the input data. An auxiliary node data file can also be given if desired.
  2. Analyse distance data and derive networks: The distance matrix constructed from input data can be thresholded manually or automatically at the identified percolation threshold to produce a network. Alternatively, the distance matrix can be used to construct a minimum spanning tree. There is a possibility to randomize the genetic data by resampling or through sample/allele shuffling to produce any number of reference networks. This procedure allows testing for the significance of various network statistics.
  3. Network analysis: Some summary statistics of the network such as number of nodes, links and components, the average degree (a node degree is the number of connections a node has) and the average clustering coefficient (Watts & Strogatz 1998) are produced automatically. A number of network and node properties can be extracted from the network, including the degree distribution, link weight/distance distribution, clustering coefficient as a function of degree and average neighbour degree as a function of a degree. If the last function is increasing, nodes of high degree tend to connect to other nodes of high degree and the network is assortative (see, for example, Newman 2010 for an introduction to the basic topological properties of networks).
  4. Network visualization: The software automatically generates a visualization of the network, optimized for clarity. The resulting network visualization can be customized using an interactive user interface. Node properties such as betweenness centrality (Freeman 1977) or any user given auxiliary attributes can be used to colour the nodes, to label them or to change their size. The network visualization can be saved as an image file (Fig. 2).
    image

    Figure 2. Examples of genetic networks of seagrass (Posidonia oceanica). Network nodes represent populations as defined by sampling sites, and links represent genetic distances. The figure is produced by EDENetworks from a genotype matrix by applying an automatic thresholding algorithm, using data analysed by Rozenfeld et al. (2008). Node colours (yellow for Western, red for Central and blue for Eastern Mediterranean) and sizes represent geographical divisions and betweenness centrality values, respectively.

    Download figure to PowerPoint

Some examples of the use of such methods include the definition of biogeographic provinces based on biodiversity inventories in hydrothermal vents (Moalic et al. 2012), the test of hypothesis of ancestral polymorphism vs. present-day hybridization to explain shared genetic polymorphism between two closely related species (Moalic et al. 2011) or the geographic pattern of genetic differentiation and connectivity among populations (Becheler et al. 2010), including the identification of putative source and pathways areas (Cowart et al., 2013; Rozenfeld et al. 2008). Some of the hypotheses that can be tested with those methods can be generically detailed with the example of the genetic network analysis of Posidonia oceanica meadows (Fig. 2) in the Mediterranean, based on microsatellite polymorphism (Rozenfeld et al. 2008). The matrix of microsatellites genotype processed through step (1) together with a set of n randomizations delivered populations pairwise differences used to (2) build the network (Fig. 2) and compare its properties at the percolation threshold to their distributions obtained by randomization. The occurrence of two clusters of populations in Eastern and Western Mediterranean, supported by the departure of the high clustering value compared with the range obtained through randomization, allows rejecting the hypothesis of a lack of hierarchical differentiation at the scale of the Mediterranean and supports the existence of at least two clusters of populations. The high and significant betweeness centrality (Fig. 2) of meadows located in the Siculo-Tunisian Straight permits rejecting the hypothesis of an equivalent role of populations in the gene flow across the system, showing populations located in the Straight contribute more importantly in facilitating or allowing connectivity across the Mediterranean.

Comparison to other software packages

Whereas some functions of EDENetworks have also been implemented in other software packages, it is at the moment the only software containing the entire pipeline from computation of distance matrices to network analysis and visualization and to statistical significance testing. For pure network visualization, the most widely used programs are Pajek, Gephi and Cytoscape; these also allow for computation of some network characteristics either directly or via plugins. For network analysis by command-line scripting (e.g. in R or Python), there is a number of options such as Networkx and igraph that, however, require considerable programming expertise of their users. Additionally, interesting exploration may be envisaged using network analysis in conjunction with methods based on circuit theory to predict gene flow, such as Circuitscape (McRae 2006).

As mentioned above, a major difference between EDENetworks and existing packages is that it implements the whole analysis pipeline, eliminating the need to use multiple software packages and to transfer files between them. Additionally, instead of attempting to be a general-purpose tool for any network analysis and visualization, EDENetworks focuses on the functions required for analysing ecological networks, while allowing exporting of network data for further analysis, for example in igraph. Note that because of the GUI of EDENetworks, no scripting or programming is required. Further, the analysis pipeline of EDENetworks has elements that are not covered by any existing software package. First, EDENetworks computes distance matrices and networks directly from raw molecular (genotypes or SNPs the presence/absence) and ecological (abundance or the presence/absence) data, using appropriate metrics computed internally by the program itself. Second, because of this, EDENetworks can use random permutations and jackknifing of raw data for null hypothesis testing and inference of the statistical significance of network parameters (clustering, betweeness centrality). Besides those specificities, EDENetworks has built-in methods for thresholding distance matrices to networks and computing spanning trees that do not require network theory expertise from the user. As discussed in the next section, and in detail in the manual, all computations carried out by EDENetworks for typical data sets are reasonably fast. Further, benchmarking presented in the manual shows that the computation times for most important procedures scale optimally with the data size.

Example data sets

A number of example data sets are distributed with the program. Their earlier analysis and interpretation is detailed in the references listed here. The distance-thresholding method has been used to address biogeography of communities, species hybridization (Moalic et al. 2011) and gene flow among populations (Rozenfeld et al. 2008) and also to study individual relatedness networks (Rozenfeld et al. 2007; Becheler et al. 2010; Moalic et al. 2011). All this research has been performed with similar algorithms and methods as those implemented in EDENetworks, which has later been successfully used to repeat all the relevant analysis in these articles. In addition, one data set on microbial diversity containing about 40 samples encompassing about 30 000 OTUs was successfully analysed (from Aires et al. 2013). A detailed tutorial of the implemented genetic distances, the flow of analysis guidelines with examples and warnings of the interpretation of results are included in the EDENetworks manual.

Finally, detailed benchmarking is available in the manual, showing that the computation times for a typical data set are unnoticeable (milliseconds) for most operations and reasonable (a few seconds) for more demanding procedures such as distance matrix generation and permutations. As an example, one run of the population level pipeline similar to the one presented in Rozenfeld et al. 2008 takes only 50 ms).

Requirements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

EDENetworks is freely available at http://www.becs.hut.fi/edenetworks/ with binaries for Windows and Linux systems together with full documentation and example input files. The Windows version comes with an installer and the Linux version is distributed as .dep package. Source code is available at https://github.com/bolozna/EDENetworks. EDENetworks is an open-source (licensed under GPL2) program written entirely in Python, and as such, it can be installed to many other systems as long as Python and the third party libraries (Numpy, Matplotlib and Himmeli) it uses are available.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

We wish to thank all the participants of the EDEN project for great discussion and interesting suggestions. We would like to thank Frédérique Viard for her help on the beta version of the software. MK acknowledges that his contribution was mainly carried out when he was working at Aalto University.

Funding

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

This work was supported by EDEN [043251] project funded by European Commission through the NEST-PATHFINDER Call on ‘Tackling Complexity in Science’ of the Sixth Framework Program, and by the ANR project Clonix.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Requirements
  5. Acknowledgements
  6. Funding
  7. References

M.K. and J.S. wrote the code. S.A-H. and M.K. planned the analysis to be performed by the software. S.A-H. wrote the manual with the help of M.K., J.S. edited the manual. M.K., S.A-H. and J.S. wrote the MS.