Bayesian phylogenetic inference from animal mitochondrial genome arrangements

Authors


Address for correspondence : Bret Larget, Department of Statistics, University of Wisconsin—Madison, Madison, WI 53706-1685, USA.
E-mail: larget@mathcs.duq.edu

Abstract

Summary. The determination of evolutionary relationships is a fundamental problem in evolutionary biology. Genome arrangement data are potentially more informative than deoxyribonucleic acid sequence data for inferring evolutionary relationships between distantly related taxa. We describe a Bayesian framework for phylogenetic inference from mitochondrial genome arrangement data using Markov chain Monte Carlo methods. We apply the method to assess evolutionary relationships between eight animal phyla.

1. Introduction

The idea that all living organisms are related through common descent is one of the fundamental organizing principles of modern biology. Consequently, the determination of evolutionary relationships is one of the most important activities that evolutionary biologists carry out. Before the widespread availability of massive collections of deoxyribonucleic acid (DNA) sequence data freely available via computer networks and desktop computers with specialized software to analyse these data, most phylogenies, branching tree diagrams that display evolutionary relationships, were inferred by biologists on the basis of morphological data and characteristics. It is fairly common for a phylogeny that is strongly supported through an analysis of molecular data to be inconsistent with the traditional phylogeny based on morphological data. To complicate matters, biologists have developed a large number of methods for producing phylogenies from DNA sequence data, the results of which frequently conflict. Each method has its strong supporters and there is a lively debate in the biological literature arguing the relative merits of various methods of phylogenetic inference.

Very few methods for producing phylogenies from DNA sequence data have a statistical foundation that provides a framework for the assessment of uncertainty (Felsenstein, 1983). The maximum likelihood approach to phylogenetic inference (Felsenstein, 1981) is one notable exception. Swofford et al. (1996) have provided an excellent overview of many commonly used methods for phylogenetic analysis from aligned DNA sequence data. More recently, several researchers have developed Bayesian approaches to phylogenetic inference from DNA sequence data (Rannala and Yang, 1996; Yang and Rannala, 1997; Mau et al., 1999; Newton et al., 1999; Larget and Simon, 1999; Li et al., 2000). Huelsenbeck et al. (2001) addresses the recent effect of Bayesian methods on evolutionary biology.

There are, however, limitations to the usefulness of DNA sequence data to infer evolutionary relationships. Boore and Brown (1998) have suggested several: selection, rapid rates of evolution and ambiguities of alignment. Under selection, nucleotide substitutions at homologous sites in different lineages could have different probabilities of propagating throughout a population. If the rate of evolution is very rapid, sequences may diverge so quickly that very little phylogenetic information may remain. If a large number of small scale deletion and insertion events occur, there can be tremendous uncertainty in any attempt to align DNA sequences by homologous sites. Boore and Brown (1998) wrote that

‘… a single, completely resolved, unambiguous tree of life based on sequence comparisons seems unlikely to be realized’.

Boore and Brown (1998) went on to suggest that gene order comparisons have several advantages, and that mitochondrial genomes are especially useful in inferring phylogeny between distantly related animals.

1.1. What is mitochondrial DNA?

Mitochondria are small organelles found outside the cell nucleus in animals, plants, fungi and protists. Whereas most DNA in animals is located in chromosomes in the cell nucleus, the mitochondria contain a relatively small circular ring of DNA. Mitochondrial DNA is doubly stranded and the genes may be on either strand, although for some animals all the genes are on the same strand. Animal mitochondrial DNA has several characteristics that are highly conserved. Most animal mitochondrial genomes contain about 16 000 nucleotide bases and contain the same 37 genes: 22 for transfer ribonucleic acids (tRNAs), two for ribosomal RNAs (large and small unit ribosomal RNA, rrnL and rrnS respectively) and 13 for proteins (nicotinamide adenine dinucleotide dehydrogenase subunits 1–6 and 4L (nad1–nad6 and nad4L), cytochrome oxidase subunits I–III (cox1–cox3), adenosine triphosphate synthase subunits 6 and 8 (atp6 and atp8), and cytochrome b (cob)). There are a few known exceptions. Several nematodes and flatworms are missing the gene atp8 in the mitochondria and have only 36 genes. The brown sea-anemone and other individuals in the phylum Cnidaria have very unusual mitochondrial genomes, missing most of the tRNAs whereas some of the other genes are not contiguous. These exceptions aside, it is interesting and potentially informative that, although the gene content is highly conserved, the order in which the mitochondrial genes are arranged can vary between different animal species. Unlike nuclear DNA, genes are tightly compact, meaning that there are very small regions of non-coding DNA between genes. All animal mitochondrial genomes contain one or more larger areas of non-coding DNA that is thought to be involved in the regulation of replication and transcription.

1.2. Why are mitochondrial genome arrangements useful for phylogenetic inference?

Boore and Brown (1998) listed several reasons why mitochondrial genome arrangement data have many advantages over other types of genetic data for phylogenetic inference among animals. These reasons include the following:

  • (a)mitochondrial gene content among all animals is nearly invariant;
  • (b)there are a very large number of possible arrangements, so animals with shared arrangements are very likely to have common ancestry;
  • (c)there is near certainty that homologous genes can be identified despite the substantial differences in the DNA sequences among homologous mitochondrial genes in different animal species;
  • (d)mitochondrial genome arrangement probably does not affect selection;
  • (e)genome rearrangements are rare, even over long periods of evolutionary time.

In the early 1990s, complete mitochondrial genome arrangements were known for only about a dozen different species. Boore and Brown (1998) listed 70 known arrangements in 1998. Helfenbein et al. (2001) reported 127 known sequences in 2001. The most recent version of the Mitochondrial Gene Arrangement Source Guide (Boore, 2001) contains the complete mitochondrial genome arrangements of the 231 different species for which this was known and published by October 31st, 2001. The rate at which new data are being collected is increasing rapidly.

1.3. What are the mechanisms of mitochondrial genome rearrangement?

Boore and Brown (1998) described several mechanisms of genome rearrangement. One mechanism is gene inversion. In a single gene inversion event, a sequence of consecutive genes is inverted which changes both the order of the genes and the strands on which the coding portions are located. Gene inversion is, perhaps, the primary mechanism by which the large non-tRNA coding genes rearrange with each other. A second mechanism for which there is evidence is a duplication–deletion sequence of events where several consecutive genes are duplicated followed by a loss of function and subsequent deletion of a randomly chosen copy from each pair which may or may not lead to a different arrangement. This type of rearrangement may occur predominantly with tRNAs—genome arrangement differences between marsupials and other mammals can be explained by one such event. Although there may be other mechanisms that move tRNAs to distant positions, these rearrangement mechanisms are not well understood.

1.4. A mathematical representation of a genome arrangement

We can mathematically represent a mitochondrial genome arrangement of n+1 genes as a signed permutation of size n. To do so, we select an arbitrary relabelling of the genes with the integers from 0 to n. Beginning at the reference gene in the direction of its transcription, the gene arrangement corresponds to a permutation of the integers from 1 to n. The signs of elements in the permutation are positive or negative depending on whether the genes are located on the same or different strand as the reference gene respectively.

Because we do not understand or know how to model effectively all the possible mechanisms that rearrange tRNAs, for the present study we consider only the mitochondrial genome arrangements of the non-tRNA genes and we assume that gene inversion is the sole mechanism by which these genes rearrange. A gene inversion manifests as a reversal in the signed permutation. Reversals change both the order and the sign of the affected elements. For example, reversing the third to the eighth elements of the signed permutation (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) results in the signed permutation (1, 2, −8, −7, −6, −5, −4, −3, 9, 10, 11, 12, 13, 14).

Complete mitochondrial genome arrangements are known from individuals from nine separate phyla (Boore, 2001). Within each phylum, we selected an individual for each unique arrangement of non-tRNA coding genes that included the full complement of 15 genes. The remaining data set, shown in Table 1, contains 19 species from eight phyla. 18 of the genome arrangements are unique. One arrangement is common for birds and acorn worms. Table 2 displays the inversion distance between each pair of species. The phyla in this data set are Chordata (vertebrates), Hemichordata (acorn worms), Echinodermata (sea-stars, brittle-stars, sand-dollars, sea-urchins, crinoids and sea-cucumbers), Brachiopoda (lamp-shells), Mollusca (clams, snails, squids and chitons), Annelida (segmented worms), Arthropoda (arachnids, crustaceans and insects) and Nematoda (round-worms).

Table 1.  Mitochondrial genome arrangements of non-tRNA coding genes
PhylumSpeciesPermutation
  1. †Each mitochondrial genome is recorded as a permutation relative to the gene order in humans beginning after the gene cox1 in the direction of its transcription. The coding to signed permutations uses the following translation: cox2, 1; atp8, 2; atp6, 3; cox3, 4; nad3, 5; nad4L, 6; nad4, 7; nad5, 8; nad6, −9; cob, 10; rrnS, 11; rrnL, 12; nad1, 13; nad2, 14. Consecutive genes in the same order as in humans are listed as a range. For example, 1→3 means 1, 2, 3 and −8→−6 means −8,−7,−6. The tick species here is rhiphicephalus sanguineus. All other known tick species non-tRNA coding gene arrangements are identical with that in fruit-flies. The land snail is Cepaea nemoralis. The other known land snail non-tRNA coding gene arrangement is identical with that in the sea-slug.

ChordataHuman1→14
ChordataDomestic chicken1→8, 10, 9, 11→14
HemichordataAcorn worm1 → 8, 10, 9, 11→14
EchinodermataSea-star6, 1 → 5, 7→11, −12, −14→−13
EchinodermataSea-urchin6, 1 → 5, 7→11, 13 → 14, 12
EchinodermataCrinoid6, 1 → 5, 7→10, −11, −12, −14→−13
BrachiopodaLaqueus rubellus10, 3, 8, −9, 5 → 6, 4, 2, 14, 1, 12, 11, 13, 7
BrachiopodaTerebratalia transversa10, 2, 4, 3, 8, 11, −9, 12→13, 7, 6, 1, 5, 14
BrachiopodaTerebratulina retusa1 → 3, 11 → 13, −9, 10, 6→8, 4 → 5, 14
AnnelidaCommon earthworm1 → 2, 4, −9, 10, 3, 8, 6→7, 11→13, 5, 14
ArthropodaCattle tick1 → 5, −13→−11, −8→−6, −9, 10, 14
ArthropodaFruit-fly1 → 5, −8→−6, −9, 10, −13→−11, 14
ArthropodaHermit-crab1, 5, 14, 2→4, −8→−6, −9, 10, −13→−11
ArthropodaWallaby louse4, 13, 10, −7→−6, 14, −8, 1, −5, −12→−11, −3→−2, −9
MolluscaSquid1, −8→−6, 2→5, −10, 9, −13→−11, 14
MolluscaBlack chiton1→3, −8→−6, −10, 9, −13→−11, 4 → 5, 14
MolluscaLand snail12, −9, 8, 13, 6, 10, 1, −2, −3, −11, −5→−4, 7, 14
MolluscaSea-slug12, −9, 8, 13, 6, 10, 1, −2, −3, −11, −5, 7, −4, 14
NematodaTrichinella spirallis1, 13, −14, −8→−6, −9, 10→12, 3→4, 2, 5
Table 2.  Inversion distances†
SpeciesDistances
  1. †For each pair of taxa, the displayed count is the smallest number of gene inversions that are necessary to change the mitochondrial genome arrangement from one taxon to the other.

Human0                  
Chicken30                 
Acorn worm300                
Sea-star5880               
Sea-urchin69910              
Crinoid688120             
Laqueus1313131413130            
 rubellus                   
Terebratalia12121212121190           
 transversa                   
Terebratulina555991013110          
 retusa                   
Earthworm78811111213950         
Cattle tick4448891311370        
Fruit-fly34477813113730       
Hermit-crab777111110141157550      
Wallaby louse8101012121313149111010100     
Squid5449910141146435100    
Black chiton55599101310153461050   
Land snail121212121312131212101112131412110  
Sea-slug1213131414141312121211121313121130 
Trichinella7881111121114810566108813130
 spirallis                   

2. A model of genome rearrangement

We assume a very simple model of mitochondrial genome rearrangement, with gene inversion as the sole mechanism. We assume that the evolutionary relationships between the taxa in our analysis are described by a phylogeny in which each speciation event results in two lineages. We do not assume a molecular clock, so the overall rate of gene inversion may be different for different lineages. Our prior distribution is that all unrooted tree topologies are equally likely. Branches of the unrooted tree have independent lengths selected from a gamma distribution. Given a branch length, a Poisson number of gene inversions with this mean are realized. Given that a gene inversion occurs, we assume that all possible gene inversions are equally likely. There are 1×3×…×(2s−5) possible binary unrooted tree topologies (Felsenstein, 1978) to relate s taxa. In our present study for the 19 taxa that we consider, this count is over 6.3×1018. Our inferences are based on the combination of 10 independent samples of the posterior. Further details of this model that are important to understand the details of our computation are contained in Appendix A.

3. Example

The correct evolutionary relationships between several animal phyla are still unresolved. Several previous papers have used mitochondrial genome arrangement data to draw conclusions about evolutionary relationships between animal phyla that differ from previous conclusions based on shared morphological characteristics. Fig. 1 is adapted from De Rosa (2001) and shows two competing versions of the evolutionary relationships between animal phyla. The left-hand tree is a traditional viewpoint supported by shared morphological characteristics. The right-hand tree has been hypothesized more recently and is supported by molecular evidence.

Figure 1.

Two competing simplified animal phylogenies: that on the left is a traditional phylogeny based on morphological characteristics; that on the right has been proposed more recently on the basis of molecular data; the branching point that divides into three lineages (Annelida, Mollusca and Brachiopoda) indicates that the three possible binary trees relating these three taxa are unresolved

We focus our attention on two conflicting aspects of these trees. The traditional phylogeny places brachiopods closer to deuterostomes (echinoderms, hemichordates and vertebrates) than to protostomes (arthropods, annelids and molluscs) and places molluscs as an outgroup to arthropods and annelids (Hyman, 1940). In contrast, the new tree has brachiopods closer to the annelids and molluscs than to the deuterostomes and places arthropods as more distantly related than annelids and molluscs (Halanych et al., 1995; Aguinaldo et al., 1997). We use the model described above to assess these aspects in conflict between these two trees. We do so by examining posterior probabilities of predicted clades. A set of taxa form a clade in a tree if they comprise a complete subtree.

3.1. Are brachiopods more closely related to deuterostomes or protostomes?

De Rosa (2001) found that a close relationship between brachiopods and protostomes is ‘most probable’, but ‘not definitely conclusive’. In our analysis, the deuterostomes appear together as a clade 35% of the time and the echinoderms appear together 100% of the time. We tend to put humans closer to echinoderms (60%) than to the domestic chicken and acorn worm (2%), but there is enough evidence in the data to change a tiny prior probability that deuterostomes are a clade substantially. However, we find no clades with a posterior probability of at least 1% that include any single brachiopod species with some subset of the deuterostomes.

In contrast, we find many clades that include the brachiopods with the annelid and several molluscs. All these taxa except for the squid appear together 87% of the time. (We have difficulty placing the squid. We place it with the fruit-fly 16% of the time, with the hermit-crab 33% of the time and with the acorn worm and chicken 10% of the time.)

None of these posterior probabilities are large, but this is because the completely uninformative prior that we place on the tree topology includes very small prior probabilities that species from within the same phylum would form a clade. This being said, we find it to be very probable that brachiopods are more closely related to protostomes than to deuterostomes, adding evidence in favour of the new phylogeny.

3.2. Are annelids and molluscs more closely related than arthropods?

Boore and Brown (2000) used mitochondrial genome arrangement data along with other evidence to conclude that molluscs and annelids are sister taxa with arthropods as an outgroup. In the part of their analysis that is based solely on gene arrangement data, they analysed a single mollusc ( Katharina tunicata ), two annelids (the common earthworm and another for which the non-tRNA arrangement is identical), as well as single inferred ancestral sequences for chordates and arthropods. Using the minimum break point ( Sankoff and Blanchette, 1998 ; Blanchette et al., 1999 ), they found a best tree with 76 break points consistent with annelids and arthropods being most closely related. The next best tree has 80 break points.

We do find several clades with relatively high posterior probability that include our single annelid with one or more molluscs, usually with brachiopods present as well. The clade of the brachiopods, the annelids and the molluscs except for the squid appear together 87% of the time. Common clades that appear at least 5% of the time that include arthropods along with the annelid invariably have brachiopods and one or more mollusc present as well. We find that the right-hand tree in Fig. 1 with a clade of brachiopods, annelids and molluscs separate from arthropods is more likely than one with annelids and arthropods as sister taxa, but that this conclusion is not as firm as the previous conclusion about the placement of brachiopods.

4. Discussion

4.1. Comparisons with other methods

Statistical methods for phylogenetic inference from genome arrangement data are in their infancy. The principle of parsimony says that the best tree is the tree that requires the smallest number of genome rearrangement events. Most papers that include phylogenetic inferences from genome arrangement data use this principle in an informal manner, drawing conclusions on the basis of shared arrangements that are evaluated by eye.

Other methods are more formal. By using the fast algorithms for computing pairwise reversal distances, it is possible to feed these genome-arrangement-based distance matrices into other methods that produce phylogenies from distance matrices to infer trees. Pevzner (2000) and the references within describe this approach. Sankoff and Blanchette (1998) described a method that estimates phylogeny by searching for arrangements at internal nodes that minimize the changes in break point distance, and Sankoff and Blanchette (1999) described a method based on invariants of frequencies of site patterns. The last two methods are not based on any mechanism of gene rearrangement. Mechanisms such as gene inversion, gene duplication–deletion and gene transposition affect the break point distance in different ways.

None of the alternative methods discussed here provide a framework for assessing uncertainty. The best tree is simply accepted as being the best. Clustering methods are prone to poor inferences because the sequence data are discarded—when two groups are joined, distances to other groups are not based on the likely gene arrangement at the ancestor of the new group. The method described in Sankoff and Blanchette (1999) uses all 37 genes but is limited to five or fewer taxa, which greatly limits its usefulness. Table 2 shows how distant individual species from the same phylum can be apart from each other. Presumably the decision on which taxon to use to represent a particular phylum could greatly affect the inference.

To the best of our knowledge, the present work is the first to make phylogenetic inferences on the basis of gene arrangements that also provides an assessment of uncertainty. Our earlier work on this problem (Simon and Larget, 2001) was limited to small artificial data sets. The computational approach described in this paper is not limited by the number of genes or taxa.

The Bayesian approach is very useful in this application, especially since the most likely tree is not very likely at all. A sample of trees drawn from the posterior distribution permits an examination of which parts of the tree are well established, and which parts are more uncertain. It also permits a calculation of probabilities of biological hypotheses, such as those above.

4.2. Directions for further work

From a modelling perspective, the first extension of this work that we would make is to include duplication–deletion and transposition as well as inversion. These additional mechanisms of rearrangement would require additional parameters for the relative speeds at which each occurs, leading to an interesting extension of this work. It would also be useful to use the tRNAs as well. A second modelling advance, to allow unequal probabilities for gene inversions of different lengths, must await further understanding of how gene inversion occurs at a molecular level to guide the development of a more realistic model.

This work may also be advanced by incorporating additional information. We could do this by jointly modelling gene arrangement processes with changes at the sequence level. We could also elicit more informative priors from experts in evolutionary biology.

Finally, we believe that advances in visualizing and summarizing samples of trees would help in this work. We should be able to infer ancestral genome arrangements, for example. This area is just beginning; there are many contributions that statisticians can make.

Acknowledgements

The authors thank the referees, whose comments led to substantial improvements in the paper, and Jeff Boore for providing several references. BL was partially supported in this work by National Science Foundation grant DEB-0075406; JBK by National Science Foundation grant DMS-9801401.

Appendices

Appendix A: Calculation details

This appendix contains the mathematical description of the model that we use, a derivation of the posterior distribution, descriptions of the Markov chain Monte Carlo proposals and a discussion of the mixing properties of the method.

A.1. A mathematical description of the model

The mathematical representation of an unrooted phylogeny for l taxa includes an unrooted tree topology τ and a vector of branch lengths β={βi}, for i=1,…,2l−3. The unrooted tree topology is a connected acyclic graph with l labelled leaf nodes (each of which is adjacent to one other node in the tree), l−2 unlabelled internal nodes (each of which is adjacent to three other nodes) and a total of 2l−3 edges (branches). This type of tree results when the root is removed from a rooted binary tree. We let Tl represent the set of all such possible unrooted tree topologies with l leaves. Our prior is that the tree topology τ is chosen uniformly at random from Tl and that the branch lengths β are independent and identically distributed from a gamma(α,λ) distribution.

Each branch of the tree contains a list of reversals and their positions. The counts of reversals on the branches, x={xi}, i=1,…,2l−3, are independent Poisson random variables with means equal to the respective branch lengths. The jth reversal on the ith branch, rij, is located a distance uij from the beginning node, chosen uniformly at random along the branch, and results in the reversal of the interval from elements aij to bij in the signed permutation, where 1≤aijbijn. The set Mn of possible reversals that act on permutations of size n has inline image elements. A reversal (a,b) ∈ Mn acts as below:

image(1)

Given (τ,x,r) and the permutation at one leaf of the tree, the remaining observable leaf permutations are determined. In fact, the permutations are determined at every point of the tree. Let D represent the observable data, an array of permutations indexed by the leaf nodes.

The prior distribution of these parameters is summarized here:

image(2)
image(3)
image(4)
image(5)
image(6)

The joint prior on these parameters is

image(7)

where

image(8)

and

image(9)

The likelihood for D is an indicator of whether the observed data are consistent with the parameters and unobservable variables, p(D|τ,β,x,r,u)=1{(τ,x,r)↪D}.

A.2. Derivation of the posterior distribution

We are primarily interested in evaluating the posterior distribution of the tree topology, p(τ|D). We begin by expressing the unnormalized joint posterior distribution of all the parameters:

image(10)
image(11)

To simplify this, we integrate out β and u analytically, suppressing most of the derivation. The remaining parameters are the tree topology and the ordered list of reversals on each branch:

image(12)
image(13)

Finally, we ignore some factors that do not depend on τ, x or r:

image(14)

We sample from this unnormalized posterior p(τ,x,r|D) using Markov chain Monte Carlo sampling (Metropolis et al., 1953; Hastings, 1970) to calculate p(τ|D).

A.3. Markov chain Monte Carlo updates

We cycle through three updates, the first two of which leave the tree topology unchanged but modify the reversal histories, whereas the third changes the tree topology and modifies the reversal histories to remain consistent. Updates 1 and 3 are displayed in Fig. 2 which also defines the nomenclature used in the following description.

Figure 2.

(a) Graph related to update 1 (it is a subtree and node O has been randomly chosen; the other nodes may be either leaf nodes or internal nodes; •, reversals) and (b) graph used in the explanation of update 3 (update 2 is not pictured because it is trivial)

  • (a)Update 1 begins by randomly picking an internal node of the tree and then randomly assigning labels to the three edges. If there are r reversals on the path from node A to node B, there are r +1 ways to partition these reversals on edges 1 and 2 without changing their relative order. One of these partitions is chosen at random, which may change the induced signed permutation at node O. Then, any reversals on edge 3 are deleted and a new sequence of reversals is generated from node O to node C in the manner described below.
  • (b)Update 2 begins by randomly picking any edge from the tree. The reversals on that edge are deleted and a new sequence of random reversals is generated for the edge in a randomly chosen direction as described below.
  • (c)Update 3 begins by choosing an internal branch (edge 3) uniformly at random. Each adjacent node then picks at random one of its other edges (edge 1 and edge 4). These edges and any subtree extending beyond nodes A and C are then swapped, resulting in a new tree topology. The reversals on edges 2, 3 and 5 remain the same. Reversals on the two swapped edges are deleted. New reversal sequences are generated for edge 4 from the signed permutation at node E to node C and for edge 1 from the signed permutation at node F to node A.

A.4. The break point graph

Our mechanism for proposing a sequence of reversals that change the source signed permutation s to the target t uses the break point graph. We note that a set of reversals that acts on a signed permutation s to produce t will also act on st−1 to produce the identity permutation, so, without loss of generality, we can consider the problem of finding a sequence of reversals to sort a signed permutation. See Pevzner (2000) and Kaplan et al. (1999) and references therein for a more detailed description of the break point graph and an algorithm to find a single minimal sequence of reversals to sort a signed permutation. We wish to be able to propose any sequence that sorts the signed permutation with minimal sequences being relatively likely.

Fig. 3 shows an example of a break point graph. We find it useful to represent a break point graph as a circle. Pevzner (2000) drew the break point graph with the nodes along a line. The definitions below are equivalent to those in Pevzner (2000) but are rephrased with the intention to add clarity.

Figure 3.

Example break point graph used to help to explain the method for proposing reversals

The outer circle of numbers is a signed permutation where an element 0 has been added to connect the beginning to the end of the permutation. This mirrors a mitochondrial genome arrangement where 0 represents the reference gene. Fig. 3 represents the arrangement of crinoids relative to humans. The inner circle of numbers is an unsigned circular permutation determined by the outer signed permutation. The element 0 is represented by 2n+1,0 where there are n+1 elements in the outer circle. For the rest, the label i corresponds to 2i−1, 2i if it has a positive sign and corresponds to 2|i|,2|i|−1 if it has a negative sign. Each element of the inner circle is a node in the break point graph. In the example, the inner circle is an unsigned permutation of size 30 (of the integers from 0 to 29), twice as large as the number of genes that we consider.

Break points are represented by black edges along the inner circle that connect adjacent nodes that are out of sequence, and so differ in absolute value by more than 1. The example has seven break points. The grey edges in the interior of the break point graph connect nodes with even values to nodes with values 1 larger when these nodes are not adjacent. The lines are oriented (full) when they are separated along the inner circle by an even number of positions and are unoriented (broken) otherwise. There are always equal numbers of black and grey edges. All nodes are either isolated or part of a cycle of alternating black and grey edges.

Two cycles are connected if a grey edge from one crosses a grey edge from the other. The cycles of the break point graph are partitioned into connected components. A connected component is unoriented if all edges of all its cycles are unoriented and is oriented otherwise. Each cycle must have an even number of oriented edges.

A hurdle is an unoriented connected component that does not separate two other unoriented connected components along the inner circle. A break point graph contains a fortress if it has a special configuration of hurdles and other unoriented components that only arises in larger permutations than those in the present study. In the example, there are two connected components, one of which is unoriented. Each component is comprised of a single cycle. The cycle on the right is

image(15)

and the cycle on the left is

image(16)

where black edges are represented by colons and oriented and unoriented grey edges are represented by the symbols -o- and -u- respectively. The connected component on the right is a hurdle, whereas that on the left is not.

The minimal number of reversals to sort a signed permutation is a function of the number of break points b, the number of cycles c, the number of hurdles h and an indicator of a fortress f:

image(17)

The example could be sorted by 7−2+1+0=6 reversals.

A reversal is called proper if it reduces bc by 1. However, all proper reversals do not reduce the distance by 1 because they could introduce a hurdle (or a fortress). A reversal will be proper if its end points are two break points on the same cycle and these two break points divide the oriented grey edges of the cycle so that there is an odd number in each semicycle. In the example, there are no proper reversals that act on the right cycle because any two break points divide the cycle into two semicycles with 0 oriented edges and 0 is not odd. In the left cycle, there are six ways to choose two of the four break points. Of these six reversals, the four proper reversals change: −11 to 11; −12 to 12; −14,−13 to 13,14; −11,−12,−14,−13 to 13,14,12,11. The first three of these proper moves actually decrease the distance by 1. The last adds a cycle but also adds a hurdle because the remaining grey edges all become unoriented, so the distance remains unchanged.

A.5. Proposing a sequence of reversals

Our basic approach is to add reversals iteratively until we have a list that changes the source to the target and decide to stop. If the signed permutation st−1 is the identity permutation itself, we quit and end the sequence of reversals with probability q=0.99. Otherwise, we propose a random reversal from all possible. When we do not have the identity, we use the break point graph of the signed permutation st−1 to partition the set of all possible reversals into three groups: proper reversals, improper reversals between break points in the same cycle and others. If there is at least one proper reversal, we choose one uniformly at random with probability p=0.99. If there are no proper reversals or we have decided not to select one and there is at least one improper reversal within a cycle, we choose one of these with probability p. If we have not yet selected a reversal, we choose one of the others at random. We then iterate, adding another reversal to the sequence at each step, until we stop. Table 3 shows the probabilities that a reversal of a given type is the next one proposed. These probabilities are used in calculating acceptance ratios for the updates.

Table 3.  Reversal proposal probabilities†
Case  Probabilities for the following types of reversal:
   ProperImproper within a cycleOther
  1. a is the number of proper reversals, b is the number of improper reversals between break points of the same cycle and c is the number of others. The sum of these three is |Mn|=105 in the present study. The parameter p is set to be 0.99. The expression in each cell is the probability that a specific reversal of the given type is proposed.

a >0 b >0 c >0 p / a(1−p)p/b(1−p)2/c
  c =0 p / a(1−p)/b
 b =0 c >0 p / a(1−p)/c
  c =0 1/a
a =0 b >0 c >0 p / b(1−p)/c
  c =0 1/b
 b =0 c >0 1/c

A.6. Computation details

We completed 10 separate runs of 100 million cycles through updates 1–3. The runs used different streams of pseudorandom numbers and began at different trees. In each run, we sampled every 500th tree topology, retaining 200 000 tree topologies. A single run required about 9 h of central processor unit time on a machine with a 933 MHz Pentium III processor. We set the parameters α=0.5 and λ=0.25 so that our prior had a mean of two gene inversions per branch with sufficient variance that branches with 10 or more gene inversions were not too unlikely.

Trace plots of the log-likelihoods indicated that the burn-in was rapid in all runs. We discarded the initial 25% of each run and retained a combined total of 1.5 million tree topologies. Clade frequencies for most clades that appear relatively frequently were similar from run to run. Almost all clades had estimated Monte Carlo standard errors that were substantially less than 1%. There were a few exceptions. We expect that proposed changes in the tree topology that include branches with longer reversal lists mix more slowly. Estimates of clade probabilities for clades that include taxa that are a long distance from others had larger Monte Carlo errors.

Ancillary