TempNet: a method to display statistical parsimony networks for heterochronous DNA sequence data


  • Stefan Prost,

    Corresponding author
    1. Department of Anatomy and Structural Biology, Allan Wilson Centre for Molecular Ecology and Evolution, University of Otago, Dunedin 9054, New Zealand
    2. Department of Integrative Biology, University of California, Berkeley, CA 94720-3140, USA
      Corresponding author. E-mail: stefan.prost@anatomy.otago.ac.nz
    Search for more papers by this author
  • Christian N. K. Anderson

    1. Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, USA
    Search for more papers by this author

Corresponding author. E-mail: stefan.prost@anatomy.otago.ac.nz


1. Heterochronous data have been used to study demographic changes in epidemiology and ancient DNA studies, revolutionizing our understanding of complex evolutionary processes such as invasions, migrations and responses to drugs or climate change. While there are sophisticated applications based on Markov-Chain Monte Carlo or Approximate Bayesian Computation to study these processes through time, summarizing the raw genetic data in an intuitively meaningful graphic can be challenging, most notably if identical haplotypes are present at different points in time.

2. We present temporal networks, an attractive way to display and summarize relationships within the heterochronous data so commonly used in ancient DNA or epidemiological research. TempNet is a user-friendly R script that creates journal-quality figures from genetic data in standard formats (FASTA, CLUSTAL, etc.). These figures are customizable and interactive within the R graphics window. Using three examples, we demonstrate that TempNet can deal with standard-sized datasets, as well as datasets of hundreds of sequences from fast-evolving organisms.

3. Temporal networks are flexible ways to illustrate genetic relationships through time. Furthermore, this approach is not limited to time-stamped data, but can also be used for different data partitioning strategies, such as spatial or phenotypic groupings. The R script presented here will be useful in illustrating complex genetic relationships between groups.


Heterochronous DNA data consist of sequences of different ages. Such data are often used in epidemiology and ancient DNA subdisciplines to study demographic changes over time (Drummond et al. 2003). In epidemiology, they can provide substantial insights into the spread and evolution of infectious diseases and therefore help to control possible future outbreaks (Pybus & Rambaut 2009). In ancient DNA research, DNA extracted from samples up to hundreds of thousands of years old is often used to investigate demographic changes in mammal populations in response to climate and habitat change or the spread of modern humans (e.g. Hadly et al. 2004; Shapiro et al. 2004; Campos et al. 2010; Prost et al. 2010).

The inferences drawn from such data when analysed with the powerful tools of the field [such as Markov-chain Monte Carlo sampling procedures–based temporal approaches (Drummond et al. 2005) and Approximate Bayesian Computation (ABC; Beaumont, Zhang & Balding 2002)] lend themselves well to graphical expression (such as skyline plots and joint posterior distributions). However, depicting the raw data itself is problematic. The familiar haplotype network is a two-dimensional, intuitively appealing summary of genetic diversity within a single group, in which the size of each node represents the frequency of a haplotype, and the length of (or number of tick-marks on) the links represents the amount of genetic divergence (Posada & Crandall 2001). To display information from more than one group, researchers must resort to replacing the nodes of the haplotype network with pie charts. The results are generally difficult to interpret; as one prominent graphic designer has said, ‘the only worse design than a pie chart is several of them’. (Tufte 2001, p. 178). A more elegant and accessible way to explore temporal coherence is through the use of a three-dimensional figure where networks from each sampled timepoint are arranged in distinct levels and haplogroups shared between are connected by vertical columns. The first example of such a design was recently published in Prost et al. (2010). In their study, two-dimensional networks were constructed using TCS software (Clement, Posada & Crandall 2000) and subsequently combined into a three-dimensional structure by hand using standard graphical tools. However, constructing three-dimensional networks by hand is difficult and time-consuming. Here, we present an R script to automatically produce three-dimensional statistical parsimony networks, substantially alleviating both problems.


TempNet is written for the open-source statistical environment R (http://www.r-project.org/) and uses the freely available ‘ape’ (Paradis, Claude & Strimmer 2004), ‘pegas’ (Paradis 2010) and ‘tcltk’ libraries (available from http://cran.r-project.org/web/packages/). The open-source script is available at http://www.stanford.edu/group/hadlylab/tempnet/. Each layer in the network is constructed using statistical parsimony (Templeton, Crandall & Sing 1992), which connects the most closely related nodes first, up until the parsimony limit is reached (the expected 97·5th percentile of the maximum number of substitutions; Templeton, Crandall & Sing 1992; Posada & Crandall 2001). TempNet uses the read.dna() function from the ‘ape’ package to import DNA sequences in standard formats (FASTA, CLUSTAL, etc.), so the user only has to specify the location of the data file and assign each sequence to a layer. He or she can do this task either by appending a $[layer#] tag to sequence names in the data file or by specifying a vector of layer numbers in R. The script will produce a two-dimensional statistical parsimony network if neither labels nor a vector is provided. Users can customize the relative size of circles corresponding to the haplotypes both present in the layer and absent in the layer and assign a scale length corresponding to one mutation on the links. Users can specify a label for each layer as well. An example data file and the R code can be found in the Supporting Information online as well as on the project homepage (http://www.stanford.edu/group/hadlylab/tempnet/).

Network appearance

Haplotypes are represented by ellipses. Each ellipse is scaled, so that its area corresponds to the number of sequences represented by that haplotype. A haplotype that is not found in a particular layer (but is found elsewhere in the network) appears as a white ellipse. The size of these ellipses can easily be changed to zero if their presence is not desired. Extant haplotypes are connected by solid lines, whereas lines connecting at least one unsampled haplotype are dotted. Haplotypes separated by more than one mutation are indicated by one small black circle for each additional mutation. Haplotypes present in consecutive layers are connected by vertical lines. By default, ellipses are arranged to minimize the distortion of genetic distance between haplogroups and to provide maximum geometric separation between nodes radiating from the same hub. Once constructed, the network can be rearranged with a point-and-click interface in the R graphics window for a cleaner layout or to emphasize certain elements of the data. If desired, the layers can be outlined by a transparent grey plane.

Results and discussion

We present three examples to illustrate the use and the strength of our R script and temporal network representations. First, we use partially unpublished data from Chan as an example of a standard ancient DNA dataset. In example two, we use ancient DNA from Caramelli et al. (2007) as well as a large modern-day sampling from GenBank to illustrate the capability of our approach to illustrate large datasets. In the last example, we use the data from Bennett et al. (2003) to show the R script’s applicability in epidemiological research. Viruses are rapidly evolving pathogens that often exhibit much higher mutation rates than mammals (Drummond et al. 2003; Duffy, Shackelton & Holmes 2008); their genetic data are consequently challenging to represent clearly. All data files used to construct the examples below can be accessed via the TempNet homepage (http://www.stanford.edu/group/hadlylab/tempnet/). Please note that the partially unpublished tuco–tuco dataset cannot be uploaded until the publication of the sequences.

Example 1

The social tuco–tuco (Ctenomys sociabilis) is an endemic species inhabiting the arid, steppe grassland in the Neuquén Province in Argentina (Bidau, Lessa & Ojeda 2008). It is listed as critically endangered in the ICUN database based on its modern-day confinement to an area of <100 km2 (Bidau, Lessa & Ojeda 2008). Chan, Anderson & Hadly (2006) used serial sampling of ancient and modern DNA to infer demographic changes through time and inform modern-day conservation efforts. The dataset used here includes sequences from the Chan, Anderson & Hadly (2006) paper and also unpublished shorter fragments provided by Chan. The dataset will be available for download on the TempNet homepage upon publication of the data.

In general, standard haplotype networks are useful tools for understanding the number, the relative frequency and dissimilarity between haplotypes within a single population. Even without stacking, it would have been clear that modern tuco–tucos (Ctenomys sociabilis) are dominated by just one haplogroup and are less diverse than their ancestors. However, the temporal network emphasizes both the reduction in diversity and the strong directionality of their evolution (see Fig. 1).

Figure 1.

 A typical ancient DNA dataset (Ctenomys sociabilis). Example 1 shows a typical ancient DNA dataset (Y. Chan, unpublished data). As with the other figures, the network’s three different time layers are in stratigraphic order with the oldest sequences at the bottom (red) and the youngest (blue) at the top. It clearly depicts the extinction of the clades on the right-hand side of the figure and the genesis of clades on the left-hand side.

Example 2

Ancient DNA datasets often include far more modern DNA sequences than actual ancient ones, as modern DNA sequencing is much cheaper and faster. In addition, ancient DNA preservation relies on many different factors and therefore is the exception rather than the rule (Hofreiter et al. 2001). We used the ancient human DNA dataset from Caramelli et al. (2007) (n = 23) along with 234 modern Sardinian mitochondrial control region DNA sequences from GenBank (DQ067827DQ067877, DQ081420DQ081607 and DQ081669DQ081715) to show that temporal network reconstructions are suitable for summarizing relationships within large datasets (see Fig. 2).

Figure 2.

 Large DNA dataset (Modern humans from Sardinia). Example 2 demonstrates the ability of the temporal network approach to clearly illustrate genetic relationships even when a large modern DNA sampling is accompanied by a small ancient DNA sampling. Here, we show genetic diversity through time in the human Sardinian population (data from Caramelli et al. 2007 and GenBank), clearly showing that ancient haplotypes are nested within the modern-day diversity.

The Sardinian prehistory is of special interest for evolutionary biologists as its modern-day population strongly differs genetically from their Italian and other European neighbours (Barbujani & Sokal 1990; Barbujani et al. 1995). For example, many studies show over-representation of rare European mtDNA and Y-chromosome haplotypes in the Sardinian population (Morelli et al. 2000; Semino et al. 2000; Quintana-Murci et al. 2003). Caramelli et al. (2007) used ancient DNA, along with published modern-day data, to infer Sardinia’s genetic prehistory. As already noted by Caramelli et al. (2007), most ancient haplotypes are still present in modern-day Sardinian populations, a fact emphasized in the temporal network by vertical lines and central position of the red circles. The temporal network’s numerous white ellipses in the ancient layer clearly show the lower diversity in the ancient samples.

Example 3

Heterochronous sampling is a common and powerful practice in epidemiological studies. Viruses have short generation times and high mutation rates, which make them perfect model organisms to study temporal demographic changes (Drummond et al. 2003). To show the applicability of our approach in epidemiological studies, we used the data from Bennett et al. (2003); which consists of 75 sequences from the dengue virus (DENV-4 isolates) randomly sampled from Puerto Rico in the years 1982 (n = 14), 1986/1987 (n = 19), 1992 (n = 15), 1994 (n = 14) and 1998 (n = 13; see Fig. 3).

Figure 3.

 Dozens of sequences from fast-evolving viruses (Dengue). Example 3 consists of temporally sampled dengue virus DNA data (from Bennett et al. 2003). Viruses usually exhibit short generation times and high mutation rates. Although these characteristics present challenges for clear data representations, we demonstrate the strong ability of temporal networks and the TempNet script to provide easy and clear illustrations.

Dengue is a mosquito-borne RNA virus, which causes dengue fever, the more acute dengue haemorrhagic fever, and dengue shock syndrome (Bennett et al. 2003). It has been transmitted at least three independent times from wild primates to humans in Africa and Southeast Asia (Wang et al. 2000). Bennett et al. (2003) used serial DNA sampling to study evolutionary change in recent outbreaks. One of the major findings was the discovery of marked evolutionary shifts in the viral population. Within each sampled outbreak, the virus’s haplotypes were in strong clusters distinct from the previous outbreak. Bennett et al. (2003) used a phylogenetic tree to illustrate this finding. However, phylogenetic trees do not clearly communicate genetic relationships through time when haplotypes are shared between timepoints. Reconfiguring their data as a temporal network shows how clearly their findings can be visualized using our approach. It also provides us with a better idea of the diversity missing from certain time layers. For example, the most common haplotype in 1987 was not sampled in 1982, although a closely related haplotype was. Connecting this haplotype to 1982’s most common haplotype requires passing the unsampled haplotype (1987’s most common), indicating that this haplotype likely existed in 1982. Another interesting finding is that certain haplotypes from all clusters were present in 1994, but only haplotypes from one cluster were still present in 1998, indicating a drastic demographic and evolutionary change between the 2 years.

There is a prominent limitation to the current temporal network approach we present in this paper. Many published studies comprise serial DNA data drawn across a continuous time-scale (e.g. Shapiro et al. 2004; Campos et al. 2010) rather than a discrete time-point sampling. In its current version, data displayed with TempNet need to be summarized into statistical group (ideally using the ecology or biology of the species) before construction of the network. Other analytical methods, such as Approximate Bayesian Computation for example, also require binning samples into statistical groups. The same grouping can also be used for the temporal networks, which makes comparisons of the results between the two approaches more practical.


We demonstrate that temporal networks are an attractive way to display and summarize relationships within the heterochronous data so commonly found in ancient DNA or epidemiological research. Complex evolutionary changes can be easily seen in the temporal network. These graphics may also be used to illustrate the differences between contemporaneous populations (for spatial sampled data, etc.) using a space-as-time approach. The presented R script is user-friendly and will likely be useful for illustrating complex relationships in research areas dealing with temporally sampled DNA data. Using three examples, we showed that TempNet can deal with standard-sized datasets, as well as big datasets of fast-evolving organisms (such as viruses).

Availability and requirements

Authors’ contributions

S.P. and C.N.K.A. programmed the R script and wrote the manuscript.


The authors thank Elizabeth Hadly for the use of her laboratory webspace to host the programme and thank Yvonne L. Chan for the use of her partly unpublished data in Fig. 1. This work was supported in part by NSF grant DEB-0743616 to Scott V. Edwards and Dennis Pearl. Stefan Prost is funded by the Allan Wilson Centre for Molecular Ecology and Evolution. Special thanks to M. N. Capella, L. F. Anderson, M. Knapp and K. A. Horsburgh for their editorial suggestions. We are also grateful to O. G. Pybus and two anonymous reviewers for very helpful suggestions to improve the quality of our manuscript.