Volume 26, Issue 11
Invited Reviews and Syntheses
Free Access

Application of network methods for understanding evolutionary dynamics in discrete habitats

Gili Greenbaum

Corresponding Author

E-mail address: gili.greenbaum@gmail.com

Department of Solar Energy and Environmental Physics and Mitrani Department of Desert Ecology, The Jacob Blaustein Institutes for Desert Research, Ben‐Gurion University of the Negev, Midreshet Ben‐Gurion, 84990 Israel

Correspondence: Gili Greenbaum, Fax: +972 8 6596985; E‐mail: gili.greenbaum@gmail.comSearch for more papers by this author
Nina H. Fefferman

Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, 37996 TN, USA

Search for more papers by this author
First published: 16 February 2017
Citations: 7

Abstract

In populations occupying discrete habitat patches, gene flow between habitat patches may form an intricate population structure. In such structures, the evolutionary dynamics resulting from interaction of gene‐flow patterns with other evolutionary forces may be exceedingly complex. Several models describing gene flow between discrete habitat patches have been presented in the population‐genetics literature; however, these models have usually addressed relatively simple settings of habitable patches and have stopped short of providing general methodologies for addressing nontrivial gene‐flow patterns. In the last decades, network theory – a branch of discrete mathematics concerned with complex interactions between discrete elements – has been applied to address several problems in population genetics by modelling gene flow between habitat patches using networks. Here, we present the idea and concepts of modelling complex gene flows in discrete habitats using networks. Our goal is to raise awareness to existing network theory applications in molecular ecology studies, as well as to outline the current and potential contribution of network methods to the understanding of evolutionary dynamics in discrete habitats. We review the main branches of network theory that have been, or that we believe potentially could be, applied to population genetics and molecular ecology research. We address applications to theoretical modelling and to empirical population‐genetic studies, and we highlight future directions for extending the integration of network science with molecular ecology.

Introduction

Many organisms in nature inhabit only discrete habitable patches within a continuous spatial matrix. This is mostly a result of physiological, behavioural and ecological constraints of the organism in question, and often also due to human‐induced fragmentation processes. In such population structures, gene flow, selection and genetic drift interact to affect important evolutionary processes such as local and global adaptation (Kawecki & Ebert 2004), migration load (García‐Ramos & Kirkpatrick 1997; Bolnick & Nosil 2007), gene swamping (Lenormand 2002) and genetic diversity loss (Templeton 2006; Allendorf et al. 2012). The evolutionary consequences of such dynamics in discrete population structures have been of great interest in the population‐genetic literature, and understanding the effect of gene flow between habitable patches has been the focus of many modelling efforts.

The classic continent–island model (Haldane 1930; Wright 1931) describes simple source‐sink dynamics, while the full‐island model (Wright 1931; Levene 1953) assumes discrete habitat patches where gene flow occurs simultaneously and equally between all patches (Fig. 1A). A more explicit spatial element was introduced by Kimura and Weiss (Kimura & Weiss 1964) with the stepping‐stone models, where patches are ordered on a one‐dimensional chain (Fig. 1B) or a two‐dimensional lattice (Fig. 1C), gene flow occurring between adjacent patches. The stepping‐stone models can also include a long‐distance migration component (Kimura & Weiss 1964) with a designated migration parameter, usually much smaller than the migration rate between adjacent patches, describing migration between patches regardless of their spatial position, equally for all patches (Fig. 1B and C). The metapopulation dynamics framework has also been applied to study evolutionary dynamics in discrete habitats in a spatially explicit context, but relaxing the equidistance assumption of the stepping‐stone model (Heino & Hanski 2001; Hanski & Heino 2003; Hanski et al. 2011). These spatially realistic metapopulation models (SRMM) allow for arbitrary positioning of patches in space, with gene flow occurring at decreasing‐with‐distance rates (Fig. 1D). Although simple in the characterizations of spatial relations between patches, the study of these and similar models has proven extremely useful for evolutionary theory.

image
Schematic depiction of discrete‐habitat gene‐flow models. Blue nodes represent habitat patches; black lines represent migration corridors, and line thickness represents gene‐flow rate. In B and C, the red nodes represent the point at infinity, through which long‐distance migration travels, and the grey lines represent the long‐distance migration rates. In A, gene‐flow rates are identical between all patches; in B and C, gene‐flow occurs between adjacent patches and long‐distance gene‐flow occurs through the point at infinity; in D, patches may be positioned arbitrarily in two‐dimensional space but gene‐flow is constrained to follow distance‐decreasing functions; in E, topology is unconstrained. Models A, B, C and D are special cases of the general network model, E.

However, interactions between patches in real populations are often nontrivial and cannot be simplified to such an extent, as in discrete habitats and fragmented landscapes migration often forms a complex gene‐flow pattern which could be uncorrelated, to some extent, with geographic distance. Can such complexities be incorporated in our models to better explain and predict evolutionary processes? Can we develop new methodologies for studying populations with complex gene‐flow patterns? Are there evolutionary phenomena that we are failing to explain by relying on simplistic models? While these questions pose difficult challenges, the extensive mathematical discipline of network theory may provide an appropriate framework for addressing some of these issues.

Network theory is a relatively new discipline which has seen application in diverse fields, such as sociology (Easley & Kleinberg 2010), ecology (Proulx et al. 2005; Bascompte 2007; Greenbaum et al. 2015), real estate markets (Seiler et al. 2014), epidemiology (Keeling & Eames 2005; Fefferman & Ng 2007), criminology (Calvo‐Armengol & Zenou 2004), animal behaviour (Croft et al. 2008; Hock & Fefferman 2011), evolutionary biology (Pickrell & Pritchard 2012; Greening & Fefferman 2014) and many others. This discipline, a branch of discrete mathematics, studies properties of mathematical constructs composed of discrete elements and connections of various types between them. The network framework allows exploration of complex topologies of interactions between discrete elements; methods and concepts for studying such topologies are continuously being developed (Wasserman & Faust 1994; Matousek & Nesetril 1998; Barabási & Albert 1999; Carringon et al. 2005; Hanneman & Riddle 2005; Bornholdt & Heinz 2006; Newman 2010; Sterbenz et al. 2011; Boccaletti et al. 2014; Lordan et al. 2014).

General migration‐matrix models have been suggested in the population‐genetics literature (Bodmer & Cavalli‐Sforza 1968), and recently population geneticists and ecologists have started to realize that networks, where nodes represent discrete habitable patches and edges represent gene flows, may be used to account for contextually responsive gene‐flow patterns between patches (Smouse 2000; Dyer & Nason 2004; Rozenfeld et al. 2008; Wagner & Fortin 2012; Dyer 2015). The previously described discrete models – such as the island model, stepping‐stone models and SRMM – are particular cases of networks, but their interpretation may be constrained by the specific topology they consider. It is when gene‐flow patterns cannot be assumed to be simple or ‘spatially realistic’ enough to be embedded into low‐dimensional spaces (see 2.1 section below) that network modelling is expected to be particularly important and provide novel insights. Network modelling can, in some sense, release population genetics from the necessity to devise specific models for specific topologies and allow us to move towards a more general framework.

In this paper, we will review the main branches of network theory that have been applied, or that we believe could potentially be applied, to population genetics and molecular ecology. The goal is to provide an overview of the implementation of these methodologies to population genetics in discrete habitats and to highlight the directions where network theory can further contribute to this research. We focus both on theoretical modelling and applied tools for empirical system‐specific analyses.

Discrete habitats as networks

Network science is concerned with the study of mathematical constructs called networks or graphs. Networks are composed of discrete elements, called nodes, and interactions or connections between these elements, called edges. It is often convenient to describe this construct using an adjacency matrix, where each matrix element describes the interaction between two nodes. Gene‐flow patterns in discrete patchy habitats can therefore be naturally described by a network with habitat patches as nodes and migration corridors or gene‐flow rates between the patches as edges (Fig. 2).

image
Network description of a discrete habitat. (A) Discrete‐habitat patches (green) in an inhabitable but traversable matrix (yellow). Both the habitats and matrix are heterogeneous, shown by different colour shades. (B) Network representation of the population, with nodes as habitat patches and edges as migration rates/corridors. Selection pressures in the different patches are shown as colour shades, and gene‐flow rates are shown as edge thickness.

Theoretical considerations in network modelling of discrete habitats

Classic discrete models, such as the island and stepping‐stone models, have looked into simple interaction‐patterns between discrete habitat patches, and SRMM have looked into distance‐dependent interactions. Although these models address habitats as discrete elements, they are mostly studied using continuous mathematical methods (e.g. Kimura & Weiss 1964; Hanski et al. 2011). The use of continuous methods to analyse discrete models can be justified when the underlying structure of the models can be topologically embedded into simple, continuous spaces (Note: an embedding of a space into another space is akin to positioning one space into another while preserving the relations between the different points in the original space). For example, the 1D stepping‐stone model without long‐distance migration, while being a discrete model, can be embedded into the continuous real line by identifying each patch (and therefore its location) with an integer, while preserving the order of the patches; this is an embedding of a discrete space into a continuous space. A similar embedding can be done with the 2D stepping‐stone model without long‐distance migration into the Euclidian plane. However, the 1D stepping‐stone model with long‐distance migration cannot be embedded into the real line as this model contains a node which is equally distant to every other node (the red node in Fig. 1B), while no point in the continuous real line shares this property. This discrete topology can still be embedded into a different simple continuous space, namely the real line with a point at infinity added to it (in mathematical topology, this is the ‘one‐point compactification’ of the real line; Alexandroff 1924). In a similar manner, the 2D stepping‐stone model with migration cannot be embedded into the Euclidian plane, but can be embedded into the Euclidian plane with a point at infinity. While the SRMM offer much more flexibility in the relative positioning of patches, the assumption that interpatch gene flows follow a distance‐decreasing function precisely means that these models can be embedded into a two‐dimensional Euclidean space; this is the ‘spatially realistic’ element of these models.

However, for many natural populations occupying discrete habitats, geographic distance is only one of several, often more significant, factors that shape gene‐flow patterns; therefore, complex gene flow should be expected to be relatively common in nature. The topologies of the stepping‐stone models and the SRMM can be described by networks (Fig. 1B–D) but, in general, networks (such as in Fig. 1E) cannot be embedded into simple continuous spaces. When such embeddings are not possible, it is crucial to account explicitly for spatial discreteness, as relying on continuous methods to describe spatial interactions may result in erroneous conclusions (Durrett & Levin 1994; Bascompte & Sole 1995; Shnerb et al. 2000). Therefore, it is becoming increasingly clear that we need to model discrete habitats with discrete methods (Urban et al. 2001, 2009; Butts 2009; Pocock et al. 2012; Cavanaugh et al. 2014). By framing evolutionary dynamics of discrete habitats in network theory terminology, both theoretical questions and empirical studies may benefit from new perspectives and existing tools.

Constructing habitat patch networks from empirical data

In network science, practical tools, concepts and methods have been developed to analyse complex empirical networks, for example linking patterns of neural processing across systems in the brain (Kinnison et al. 2012), or finding vulnerabilities that can lead to blackouts from failures in the power grid (Albert et al. 2004). The application of network theory to molecular ecology has, thus far, been predominantly to analyse empirical data, particularly in the context of landscape genetics (Dyer 2015; see also Table S1, Supporting information).

For such analyses, the first step is always to define the nodes and edges. Delineating nodes is often straightforward, as it should be congruent to the ecological definition of the habitat patches. Edges represent gene‐flow rates, and are more difficult to derive from empirical data. While edges could be approximated from nongenetic data, for example by combining mark–recapture methods, movement data, behavioural observations, life histories and landscape analyses to deduce migration and reproduction rates, gene‐flow rates are most often inferred from interpopulation genetic‐distance measures. Several measures have already been employed to construct networks, depending on the type of genetic data available (see Table S1, Supporting information): FST (Weir & Cockerham 1984), nucleotide distance (Tajima 1983), Goldstein distance (Goldstein et al. 1995), Jensen‐Shannon divergence (Masucc et al. 2012) and the squared Euclidian distance between centroids as used in AMOVA (Dyer & Nason 2004). There has also been utilization of outputs of analyses such as principle component analysis (PCA) or ADMIXTURE (Alexander et al. 2009) to construct habitat patch networks (Paschou et al. 2014).

These procedures usually result in very dense pairwise matrices; that is, almost every pairwise relation between habitat patches is characterized by a nonzero value. To derive sparser networks, which are more workable, edge‐inclusion criteria (or ‘edge pruning’) are typically used, targeting only the stronger or more informative relations. While the problem of determining how to formulate such criteria is currently being studied in general network science (Serrano et al. 2009; Radicchi et al. 2011; Dianati 2016), several criteria have been suggested and applied in the population‐genetic literature: conditional independence (where we remove edges representing partial correlation coefficients that are small enough to imply that other edges in the network are sufficient to explain the total genetic covariance. Implemented in the software package Popgraph; Dyer & Nason 2004); the threshold below which the network becomes disconnected known as the percolation threshold (Rozenfeld et al. 2008); and the threshold resulting in the network with highest modularity (see communities in networks section below) value (Kininmonth et al. 2010). Of these, arguably, conditional independence is more appropriate for most applications, as the modularity criterion is suitable only for questions regarding the modular structure of the network, and the interpretation of the percolation threshold in these networks is relevant only if the question relates to the potential of an allele present in one network component to reach another component (see percolation and diffusion section below). However, as edge‐inclusion may affect inferences made from network analyses, particularly the scale for which structure is maintained in the network after edges are removed (Serrano et al. 2009), perhaps a more conservative approach would be to test several edge‐inclusion thresholds or criteria, and develop methods to synthesize and interpret the results of several analyses. This has been implemented in other network‐based population‐genetic analyses (Greenbaum et al. 2016), and the topological field of persistent homology (Edelsbrunner & Harer 2008) may provide the relevant mathematical framework for developing such an approach.

Addressing evolutionary dynamics in discrete habitats using networks

Networks may represent a variety of systems, from genes to populations, from brain neurons to the Internet, and network theory has therefore branched out, with different branches addressing different issues related to network topologies and processes. In this section, we will review those branches that have been or may be applied to population‐genetic questions in discrete habitats. Table 1 summarizes the main network terms in this review, along with their population‐genetic equivalents and key literature. The Table S1, Supporting information lists studies that have developed or implemented network methodologies to study population genetics in the context of discrete habitat patches. Box 1 and Box 2 present a concrete example of a network formulation of evolutionary dynamics in habitat patch networks with heterogeneous selection.

Table 1. Main network terms mentioned in this paper
Network term Network description Example of ecological/evolutionary interpretation Relevant papers, where appropriate (Network/Biology)
Node (vertex) Discrete entity Habitat patch/deme
Edge (link) Connections/interactions between nodes Migration rate/migration corridor
Network (graph) A set of nodes with connecting edges Patch networks/mosaics Newman 2010; /Pascual‐Hortal & Saura (2006), Rozenfeld et al. (2008), Urban et al. (2009), Baranyi et al. (2011)
Weighted network A network with numerical weights assigned to each of the edges Patch networks with varying levels of gene flow between patches Newman (2010)/this paper
Directed network A network where edges have a direction from one node to the other Patch networks with asymmetric gene flow between patches Newman (2010)/Morrissey & de Kerckhove (2009)
Adjacency matrix A matrix describing a network, in which unconnected nodes are set to 0 and connected nodes are set to 1 (or the weight of their connecting edge) Newman (2010)/—
Node degree The number of edges (or sum of edges weights) connected to the node Connectivity of a patch Newman (2010)/Estrada & Bodin (2008)
Centrality A class of measures used to assign values to nodes according to their position in the network structure. There are various types of centrality measures used to capture different types of values/structures Patches of particular importance for particular processes, for example gene flow, disease spread, local adaptation Landherr et al. (2010); Newman (2010)/Rozenfeld et al. (2008)
Community A group of nodes that are densely connected within the group and sparsely connected to nodes outside the group Subpopulation/region Girvan & Newman (2002); Newman & Girvan (2004)/ Fortuna et al. (2009), Fletcher et al. (2013)
Component (maximal) A group of nodes connected between themselves and not to any other node in the network A maximal connected set of patches Newman (2010)/Holstein et al. (2014)
Diffusion process A process that spreads between adjacent nodes Spread of alleles or mutations through a patch network Newman (2002a)/Thomas et al. (2012); Neuwald & Templeton (2013)
Percolation threshold A threshold (e.g. for the mean degree) above which a diffusion process may cover a significant portion of the network A theoretical threshold for patch connectivity above which alleles or mutation are expected to spread to many patches Broadbent & Hammersley (1957); Cohen & Havlin (2010)/—
Multilayer A generalization of a network which consists of several layers, each of which is itself a network of nodes and intralayer edges, with an additional set of interlayer edges connecting nodes in different layers Hierarchical population structure Kivelä et al. (2014)/—
Multiplex A multilayer where all layers contain the same network Populations with complex life histories, epistasis Kivelä et al. (2014)/—
Hypergraph A generalization of a network, where instead of edges connecting pairs of nodes, there are hyperedges connecting arbitrary many nodes Ecological Genomics Berge & Minieka (1973)/Weighill & Jacobson (2015)

Box 1. Modelling selection‐migration dynamics in networks

Here, we will consider discrete habitats with complex gene‐flow patterns and heterogeneous selection pressures in the different habitat patches (see Fig. 2). The representation of discrete habitats as network adjacency matrices is convenient for formulizing evolutionary dynamics. We will present a simple model for selection‐migration dynamics using a network with self‐loops (edges that connect a node to itself).

The model describes n habitats, where in each habitat the population experiences different selection pressures. Gene flow between patches i and j is given by migration rate mij, representing the proportion of the population at patch i replaced by migrants arriving from patch j. Self‐loops (i.e. the diagonal of the migration matrix) represent the proportion of the populations that do not migrate, that is urn:x-wiley:09621083:media:mec14059:mec14059-math-0001. The model assumes a haploid population and tracks the frequency of one specific allele, urn:x-wiley:09621083:media:mec14059:mec14059-math-0002. The effect of selection on a patch with selection coefficient s and frequency ft is given by a function g describing the change of frequency in one generation, ft+1 = g(sft) (e.g. for soft selection urn:x-wiley:09621083:media:mec14059:mec14059-math-0003; Wallace 1975). Selection is assumed to occur immediately after dispersal. With these definitions, we can now specify the recursion describing our model:
urn:x-wiley:09621083:media:mec14059:mec14059-math-0004(eqn 1)

Each term in the summation represents the frequency change attributed to individuals arriving from patch j (or individuals staying in patch i for the case of j = i), with each migrant group arriving from patch j weighted in proportion to mij. The network mij, along with the selection coefficients si, fully describes the model.

This model can be generalized to address temporal changes in selection and migration by having s and mij contain functions rather than fixed values. It can also address life‐history traits that affect migration and selection by having g depend on selection coefficients in both the source and the sink nodes of migration edges, in proportion to the time spent in each, as well as the cost of migrating between patches. Box 2 shows an elaboration on the formulization of this model.

Box 2. Selection‐migration dynamics and walks on networks

The selection‐migration model in Box 1 describes the evolutionary dynamics in the patch network, but the recursive formulation does not provide much insight; moreover, it requires calculation of the dynamics in the entire system to follow dynamics in a single patch. A more tractable formulation can be achieved using the concept of random walks on networks. A walk on a graph (network) is a sequence of nodes such that each pair of adjacent nodes in the sequence is connected with an edge in the network (West 2001), thus capturing the notion of travelling on the network from node to node along the network's edges. The weight of a walk, in weighted networks, is defined as the product of the edges included in the walk (Newman 2004a).

One can partition the gene pool in node i at time t to genes arriving from different patches of origin (at time 0) and through different walks on the networks to node i. Suppose we follow a walk w of length t from node j to node i, where each edge describes the proportion of the population migrating along that edge or staying in the same patch in one generation. Each generation we will be following a smaller group of individuals as different parts of the group follow different edges, but the frequency of the allele in this small group we are following will be determined solely by the frequency at the patch of origin and by the selection experienced at habitat patches along the walks. The proportion of the gene pool determined by the genes ‘following’ this walk is the proportion of the original population that remains at the end of the walk, that is the weight of the walk, urn:x-wiley:09621083:media:mec14059:mec14059-math-0005. The frequency of this part of the population is determined by the selection coefficients along the walk, urn:x-wiley:09621083:media:mec14059:mec14059-math-0006, which we will denote as urn:x-wiley:09621083:media:mec14059:mec14059-math-0007. As each genealogy in each walk is independent, the frequency at node i at time t is the sum over all relevant walks:
urn:x-wiley:09621083:media:mec14059:mec14059-math-0008(eqn 2)

where urn:x-wiley:09621083:media:mec14059:mec14059-math-0009 is the set of all walks of length t terminating in node i, and urn:x-wiley:09621083:media:mec14059:mec14059-math-0010 is the frequency at time 0 in the first patch in walk w.

The main benefit of this reformulation is that it is no longer recursive, making it more explicit. This also allows teasing apart the contribution of patches, connections of patches or regions of the patch network to the allele frequency in any particular patch in a given time. For example, considering the walks in urn:x-wiley:09621083:media:mec14059:mec14059-math-0011 that include a patch j (and the selection along these walks) allows for the evaluation of the effect patch j has on the allele frequency of another patch i, and this effect can be assessed for the equilibrium state by taking t to be large. How a particular edge or group of patches or connections affect the allele frequency can be addressed in a similar manner. This could be useful for evaluating the influence of specific patches, or groups of patches, on the potential of local adaptation and migration load on patches of interest, or to identify patches or migration corridors with high influence on local adaptation of other patches.

Centrality measures

In many network applications, the major interest is in identifying nodes that are of importance for a particular function or process. A variety of centrality measures have been developed to evaluate different aspects of nodes' importance: degree centrality for local importance in the node's neighbourhood; betweenness, flow betweenness and random‐walk betweenness for centrality in flow or diffusion processes; closeness and random‐walk closeness for speed of information transfer; eigenvector centrality for the influence of a node on the network (see Newman 2010 for definitions of these and other measures, and Landherr et al. 2010 for a critical review). These measures assign a numeric value to each node, allowing ranking of nodes according to their centrality, making identification of the most central nodes, for a particular process, possible. The different measures not only address different functions of the network, but they employ different assumptions on the behaviour of the studied function.

In molecular ecology, centrality measures can be used, particularly in empirical population studies, to investigate the importance of certain habitat patches to different evolutionary processes. In many ecological studies, and particularly in conservation, it may be important to know which habitats and which migration corridors are essential for maintenance of genetic diversity in the population as a whole, which constitute major pathways for gene flow, and which affect potential for local adaptation. Network analysis has been applied to identify central habitat patches in many systems and taxa, including mammals (Garroway et al2008, 2011; Ball et al. 2010; Creech et al. 2014; Fiset et al. 2015), frogs (Munwes et al. 2010; Naujokaitis‐Lewis et al. 2013), invertebrates (Janes et al. 2014; Triponez et al. 2015), seagrass (Rozenfeld et al. 2008), trees (Richards et al. 2009; Herrera‐Arroyo et al. 2013) and annual plants (Sexton et al. 2016). In these studies degree, betweenness, and eigenvector centralities, and occasionally closeness and flow centralities, have been used to locate patches central to gene flow in the network, highlighting subpopulations of ecological or conservation interest.

Although many centrality measures have been applied in such studies, not much attention has been given to the different functions these measures address. The appropriate measure to be used in a particular study should depend mainly on the ecological or evolutionary question of interest. Degree centrality (at the level of the node) can be used as a measure of connectivity of a patch (the number of patches the patch is connected to), but it only quantifies centrality at the local scale, and may be misleading for more global questions, such as for conservation prioritization. A more global centrality measure is eigenvector centrality, which quantifies the influence of nodes on the entire network by considering not only the connection of the nodes but also the centrality of the nodes to which it is connected (more precisely, it ranks nodes based on the attraction of random walks at the stationary state). This centrality measure can perhaps be useful for questions regarding local adaptation and migration load, particularly in theoretical and modelling studies which involve selection.

Most often, when applying centrality measures to empirical habitat patch networks, interest lies in identifying patches essential for gene flow in the network as a whole; the class of ‘flow measures’ is more likely to be correlated with gene flow than other classes. Many studies have applied betweenness, the most commonly used flow centrality measure, which counts the number of shortest paths connecting the rest of the network that pass through each given node. However, this measure is designed for studying information that is supervised and spreads only along shortest paths in the network, as is most often the case in social or computer networks where information is consciously directed. As gene flow in natural populations is not confined to shortest paths, nor is it consciously supervised, this measure, as well other ‘supervised’ flow measures, is less than optimal. Perhaps the most appropriate centrality measure in respect to gene flow is the random‐walk betweenness (Newman 2005), which quantifies the amount of random walks between nodes that pass through each node. This measure better reflects the unsupervised and stochastic nature of gene flow between habitat patches.

Communities in networks

Many networks in nature have a modular structure, where certain groups of nodes are more connected among themselves than to other nodes in the network. Such substructures within a network are known as modules or communities (the name stems from human social networks, where such substructures are interpreted as social communities). While there is no single definition for a community, it is roughly thought of as a dense subnetwork (subgraph) within a network. The study of communities in networks has gained considerable momentum with the development of the modularity measure (Girvan & Newman 2002; Newman 2004b; Newman & Girvan 2004) which denotes quality values to community partitions of the networks. The basic idea for modularity evaluation is to compare the intracommunity densities (number and weights of edges) with what would be expected in a random network with the same node degrees. This allows quantification of how more (or less) modular the network is from a similarly structured random network. The measure of modularity, as well as other techniques that have since been developed, allows for computationally efficient detection of communities even in large and complex networks (reviewed in Fortunato 2010).

Many discrete habitats in nature are also modular, and gene‐flow patterns in such habitats often have hierarchical and modular genetic structures (e.g. McCauley & Eanes 1987; Fletcher et al. 2013; Viricel & Rosel 2014; Pisa et al. 2015). Such hierarchical structures are often difficult to observe, especially when the underlying migration patterns are complex. Traditionally, such structures are detected using the F‐statistics framework (Wright 1950), but here only a priori, putative, structures can be tested, and complexity of migration is not accounted for. Therefore, some studies have adopted a network framework and utilized community detection procedures to detect structure at various hierarchical levels, whether by looking at networks of individuals (Cohen et al. 2013; Greenbaum et al. 2016) or networks of habitat patches (Fortuna et al. 2009; Kininmonth et al. 2010; Munwes et al. 2010; Albert et al. 2013; Fletcher et al. 2013; Peterman et al. 2016). Detected communities are interpreted as clusters of habitat patches that are more genetically similar between themselves than to other habitat patches. The detection of these structures is done without a priori partitioning or knowledge of the number of clusters present in the population, which is often advantageous in molecular ecology studies, where such information is unavailable, or when we do not want to confine the results to presumed subpopulations.

While the rather limited utilization of community procedures in molecular ecology has so far been mostly focused on revealing hierarchical structures in empirically sampled populations, theoretical studies aimed at understanding evolutionary dynamics in discrete habitat patches may benefit from adopting a network approach as well. There is currently an effort among network theorists to provide a more general framework for studying modular networks (Fortunato 2010), particularly using generalized community models (Newman & Peixoto 2015; Zhang et al. 2016). Such models, adapted to the population‐genetic context, may be used in the future to ask questions regarding the formation of complex modular structures through evolutionary dynamics, or about processes in modular habitat patch networks, such as spread of alleles or local adaptation.

Percolation and diffusion

Percolation theory in networks is concerned with the ability to find an available path through a network (i.e. if introduced on one side, what is the probability of being able to follow a path through the network to reach the other side successfully; Broadbent & Hammersley 1957), which relies on understanding the formation of disconnected components in networks by different processes. Both the formulation of this question and the techniques used to address it have been approached in many ways.

One formulation of a percolation problem is concerned with the ability of networks to remain connected when nodes or edges are removed from the network. It has been observed that in many networks there is a sharp threshold, termed percolation threshold, above which the networks remains relatively intact with one component containing a significant proportion of the nodes in the network (this component is known as a giant component), and below which networks breaks down into many small disconnected components. This behaviour of a sharp transition between two regimes means that a system that depends on connectivity between different parts of the network may collapse rapidly from a functional state to a dysfunctional state, once the percolation threshold is breached. Such behaviour can be of severe consequence to populations with complex gene‐flow patterns between habitat patches that depends on adequate gene flow; understanding where the percolation threshold lies may be crucial for conservation and management of such populations (Bascompte & Sole 1995; Cumming et al. 2010). The tailoring of this question to the narrower formulation of ‘invasion percolation’, which considers ‘paths of least resistance in traversing the network’ rather than the probability of an available path simply existing, may also be of use (Wilkinson & Willemsen 1983; Furuberg et al. 1988), as networks that are connected but are characterized by very high resistance between regions may also be of conservation concern. This has recently been addressed in landscape genetics (McRae et al. 2008; Schwartz et al. 2009; Kershenbaum et al. 2014), but seemingly not in the context of patch networks.

A second formulation of a percolation problem that may prove useful in the study of evolutionary dynamics is concerned with the extent to which a diffusive process would spread through a network. This has been well studied in the context of epidemiology, where the probability and size of an outbreak is studied using percolation theory (Newman 2002a,b). Here, the percolation threshold delineates scenarios where an outbreak, infecting a large part of the population, may occur, and the probability and size of such occurrences is addressed. This epidemiological problem resembles the problem of spread of alleles, particularly novel mutations, in patch networks. Here, the probability and extent of reach of new mutations in a patch network depend on the structure of the network, the levels of gene flow and the selection pressures at the different habitat patches. While this problem is yet to be addressed using network terminology, existing formulations in network epidemiology (a network describing epidemiologically relevant contacts between individuals) may prove extremely useful.

Although percolation addresses the probability of there being a path through a network and the emergence of a giant component, it does not address more nuanced questions of network‐based flow that can be of significant interest when trying to anticipate dynamics in a network over time. Traditionally, continuous diffusion models are used to explore such dynamics (Crank 1975), and similar discrete network diffusion models have also been developed (e.g. Leskovec et al. 2007; López‐Pintado 2008; Kasprzyk 2012). While our understanding of the general behaviour of diffusion processes in networks is still limited, models for studying these behaviours might prove useful to answer questions such as: What is the expected time for a successful, locally adaptive mutation to reach another environmentally similar habitat? If a mutation is introduced at a particular patch, after a set duration, what is its likely distribution throughout the entire network (i.e. how far will it have permeated)? What are the transient spread dynamics and the ultimate stationary distributions (if any) for a mutation spreading to new patches and continuing to circulate among patches where it is already present (see, for example, Neuwald & Templeton 2013, where the temporal diffusion of alleles in a patchy habitat of collared lizards is tracked)? If the new mutation is beneficial only in certain patches, how will the diffusion dynamics affect the potential for local adaptation in these patches?

Multilayers, multiplexes and hypergraphs

While the above techniques may already provide a diversity of relevant tools for purposes of exploring discrete systems, they have all been limited in their representations of those systems to traditional networks, that is a set of nodes and the edges that connect pairs of them. Many useful generalizations of this concept have also been developed, most falling under two frameworks: multilayer networks (Kivelä et al. 2014) and hypergraphs (Berge & Minieka 1973). Multilayer networks consist of several interconnected layers, each containing a regular network (nodes within the same layer may be connected by intralayer edges, while nodes in different layers may be connected by interlayer edges). A multilayer network where the nodes in the different layers are the same entities is called a multiplex network (Lee et al. 2012). This generalization allows extending the applicability of network theory to systems that contain more than one network, such as in explorations of competition among viral strains transmitted in networks where the nodes are hosts (Darabi Sahneh & Scoglio 2014), or the resilience of passenger rerouting in air travel with multiple carrier networks and random system failures (Cardillo et al. 2013).

Multilayer networks can be applied to the study of evolutionary dynamics in discrete habitats in several ways. First, multilayer networks can be used to model hierarchical population structures, where gene‐flow patterns at local and regional scales are governed by different dynamics. In this formulation, a local network, labelled as a layer, describes the gene‐flow patterns between patches in one region and consists of patches and intralayer edges, and inter‐regional dynamics are described by adding interlayer edges describing gene flow between patches from different regions. Different dynamics at different spatial scales may lead to different patterns in inter‐ vs. intra‐edges, which will result in different patterns of degree distributions.

Multiplex networks may be used to explicitly model selection and migration in organisms with different migration and selection patterns and different life stages and\or sexes. A multiplex of discrete habitats would consist of several layers, where in each layer, each habitat patch is represented by a node. The layers represent the different types of life stages (and/or different sexes) of the organism, and life stage‐specific migration patterns and selection pressures may result in different networks at the different layers. The edges that connect the nodes between layers (only nodes representing the same patch may be connected between layers in a multiplex network) describe the demographic rates of flow from one life stage to another. The allele frequencies at each level of the multiplex indicate the allele frequency of a given stage in a given patch. Such multiplex networks may be used to address questions of centrality, percolation, and diffusion while taking into account all components of the population.

Multiplex networks can also be used to model selection acting on a suite of interdependent alleles. The gene‐flow dynamics at a given locus can be described by the dynamics on one layer, while the selection coefficients for each layer are described by a different vector (for example as shown in Box 1). Allele frequencies are determined in each time step by evaluating the interaction of alleles in all layers (loci), taking into account the allele frequencies to determine the probability of co‐occurrence of the interacting alleles (selection pressures may differ given presence of other alleles in the population). If gene‐flow patterns are identical for all alleles, then the different layers may be identical; however, in the case of alleles affecting, for example, dispersal, gene‐flow patterns may be different (e.g. an allele affecting a long‐range dispersal trait should be formulated as a different patch network than an allele not affecting dispersal).

While a multiplex network with several layers may be sufficient to model several interacting alleles, this interaction can often be very complex and involve many alleles. Hypergraphs are mathematical constructs similar to networks; they consist of nodes and hyperedges, where hyperedges may connect more than a pair of nodes (i.e. a hyperedge may connect triplets, quadruples of nodes; Berge & Minieka 1973). Hypergraphs are often used in the field of genomics to represent interactions among several alleles for particular functions (Sole & Pastor‐Satorras 2006; Tian et al. 2009; Weighill & Jacobson 2015). Thus, a system comprised of a multiplex network, describing gene‐flow dynamics in different alleles, coupled with a hypergraph describing the interaction of the different alleles, provides a general network model for addressing questions in ecological genomics (Savolainen et al. 2013; Landry & Aubin‐Horth 2014).

Work on these network extensions is by no means exhausted and novel mathematical techniques in this area are a focus of many recent publications. However, while more complicated in their mathematical characterization, already each of the measures of interest mentioned above (e.g. centrality measures, communities, random walks, and percolation and diffusion) has all been extended to capture analogous features in hypergraphs, multiplex networks and multilayer networks (Bonacich et al. 2004; Bradde & Bianconi 2009; Gao et al. 2011; Cellai et al. 2013; De Domenico et al. 2013; Gomez et al. 2013; Lu & Peng 2013; Battiston et al. 2014; Boccaletti et al. 2014; Solé‐Ribalta et al. 2014).

Conclusion

Network methods have so far been primarily applied to detect and analyse characteristics of gene‐flow patterns in discrete habitats. This significantly expands our molecular ecology toolkit for examining population structure, for example by allowing us to identify central patches, central corridors and hierarchical structures. However, one of the main challenges of molecular ecology is to connect these patterns to evolutionary and ecological processes, not merely to observe them. Molecular ecology, and conservation genetics in particular, will be able to considerably move forward when we will be able to assert predictions regarding the evolutionary consequences of real‐world population structures. In this aspect, network theory is yet to be exploited, and we believe is has much more to offer in terms of theoretical modelling. Using a network framework to design population‐genetic models (for example as in Boxes 1 and 2), we might be able to get a better handle on evolutionary dynamics.

To understand the role of patch networks in evolutionary dynamics, we would first need to evaluate the typical gene‐flow topologies found in natural systems. These topologies may, in some circumstances, be approximated fairly well by models such as the stepping‐stone or SRMM, but when geographically‐independent factors influence migration and gene flow (e.g. heterogeneous matrix, as in Fig. 2; migration corridors; anthropogenic influences, when organisms ‘hitchhike’ on human transport networks), neglecting the complexity of gene flow may be problematic. In this context, it is important to remember that the evolutionary effect of gene flow is related to both distance and quantity – low amounts of long‐distance gene flow can have a much larger effect than large amounts of short‐distance migration, as demonstrated by the stepping‐stone model (Kimura & Weiss 1964). Therefore, even if geographic distance is the main factor influencing gene flow, other factors inducing long‐distance migration cannot always be neglected when considering evolution, and a complex topology may need to be assumed even in such cases.

In many fields, important insights have been gained by examining the peculiar characteristics of the network description of the systems in those fields, and in molecular ecology, we are only beginning to describe network topologies. As more and more natural discrete‐habitat gene‐flow networks will be described and analysed, we will be better able to characterize gene‐flow topologies and get a clearer idea of the underlying processes forming and acting upon such networks. This will be an important step in understanding the role of gene‐flow complexity in evolutionary dynamics, and network modelling will be crucial at this stage. Particularly, as was pointed out earlier, knowing to what extent natural gene‐flow networks are simple or ‘spatially realistic’ would help us understand whether, and in what circumstances, previous modelling efforts where realistic, and how important it will be to further develop the integration of network theory into population genetics and molecular ecology.

Acknowledgements

The authors would like to thank Alan R. Templeton, Shirli Bar‐David, Dina M. Fonseca and anonymous reviewers for their insightful comments. The collaboration between the co‐authors was made possible thanks to the Aharon and Ephraim Katzir Fellowship granted to GG by the Batsheva de Rothschild Fund.

    G.G. and N.H.F. both developed the concept and ideas in this review. G.G. primarily wrote the paper, and significant contributions were made by N.H.F. Both authors gave final approval for publication.

      Number of times cited according to CrossRef: 7

      • The Impact of Host Metapopulation Structure on Short-term Evolutionary Rescue in the Face of a Novel Pathogenic Threat, Global Ecology and Conservation, 10.1016/j.gecco.2020.e01174, (e01174), (2020).
      • Network-based hierarchical population structure analysis for large genomic data sets, Genome Research, 10.1101/gr.250092.119, 29, 12, (2020-2033), (2019).
      • Network science of biological systems at different scales: A review, Physics of Life Reviews, 10.1016/j.plrev.2017.11.003, 24, (118-135), (2018).
      • Seascape genetics and biophysical connectivity modelling support conservation of the seagrass Zostera marina in the Skagerrak–Kattegat region of the eastern North Sea, Evolutionary Applications, 10.1111/eva.12589, 11, 5, (645-661), (2018).
      • Editorial 2018, Molecular Ecology, 10.1111/mec.14446, 27, 1, (1-34), (2018).
      • Detecting hierarchical levels of connectivity in a population of Acacia tortilis at the northern edge of the species’ global distribution: Combining classical population genetics and network analyses, PLOS ONE, 10.1371/journal.pone.0194901, 13, 4, (e0194901), (2018).
      • Small-scale genetic structure in an endangered wetland specialist: possible effects of landscape change and population recovery, Conservation Genetics, 10.1007/s10592-017-1020-0, 19, 1, (129-142), (2017).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.