• molecular clock;
  • Bayes factors;
  • calibration hypotheses;
  • Bayes Factor Cluster Analysis;
  • substitution rate


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  1. Molecular clock calibration is a crucial step for placing phylogenetic trees in the temporal framework required to test evolutionary hypotheses and estimate evolutionary rates. In general, most authors agree that the best approach is to incorporate multiple calibrations to avoid the risk of bias associated with a single dating source. However, the indiscriminate inclusion of as many calibration points as possible can lead to tree shape distortion and an overestimation of the variation in evolutionary rates among branches due to errors in the geological, paleontological or paleogeographic information used for dating.
  2. We present a test of congruence among calibration hypotheses to assist their filtering prior to molecular clock analysis, which we have called Bayes Factor Cluster Analysis (BFCA). This is a heuristic method based on the comparison of pairwise calibrations hypotheses by Bayes factors that allows identifying sets of congruent calibrations.
  3. We have tested BFCA through simulation using beast and mcmctree programs and analysed a real case of multiple calibration hypotheses to date the evolution of the genus Carabus (Coleoptera: Carabidae).
  4. The analyses of simulated data showed the predictability of change in Bayes factors when comparing alternative calibration hypotheses on a particular tree topology, and thus the suitability of BFCA in identifying unreliable calibrations, especially in cases with limited variation in evolutionary rates among branches. The exclusion of inconsistent calibrations as identified by BFCA produced significant changes in the estimation of divergence times and evolutionary rates in the genus Carabus, illustrating the importance of filtering calibrations before analyses.
  5. The method has been implemented in an open-source R package called bfca to simplify its application.


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The molecular clock hypothesis states that molecular changes occur at an approximately constant rate over time, and therefore, molecular divergence between species is proportional to time from their evolutionary separation (Zuckerkandl & Pauling 1965). In phylogenetics, this rate of change is typically inferred by extrapolating from the known age of a particular node, in a process known as calibration. Nonetheless, heterogeneity in rates of molecular evolution is common among genes (Wolfe, Sharp & Li 1989; Aguileta, Bielawski & Yang 2006; Pons et al. 2010) and species (Bousquet et al. 1992; Thomas et al. 2006; Bromham 2009; but see criticism by Schwartz & Mueller 2010). Thus, the need to accommodate such rate heterogeneity has stimulated the development of a wealth of relaxed clock methods, including maximum-likelihood smoothing (Sanderson 1997, 2003) and Bayesian estimates based on calibration priors (Thorne & Kishino 2002; Yang & Rannala 2006; Drummond & Rambaut 2007). Among these, the latter are perceived as superior for their ability to realistically account for uncertainties in calibrations by using statistical distributions with soft bounds (Inoue, Donoghue & Yang 2010). However, besides biological sources of heterogeneity in molecular evolutionary rates and regardless of the approach employed, the application of the molecular clock can still be flawed because of methodological hindrance in accurate branch length estimation (Hedges & Kumar 2003) and, especially, of the unreliability of selected calibration ages and their correct application to the appropriate cladogenetic events (Smith & Peterson 2002; Graur & Martin 2004; Near, Meylan & Shaffer 2005; Ho & Phillips 2009).

Calibration of phylogenetic trees with paleontological or geological data is a process subject to multiple difficulties and potential sources of error. These include incompleteness of the fossil record, taxonomic misidentification, dating errors either of fossil strata or specific geologic events, suitability of specific biogeographic hypotheses, incomplete taxon sampling or even the assignation of calibrations to the appropriate nodes in the phylogeny (Benton & Ayala 2003; Bromham & Penny 2003; Near & Sanderson 2004; Heads 2005, 2010; Ho & Phillips 2009). The concurrence of so many sources of uncertainty has favoured the idea of using simultaneously as many calibrations points as possible to overcome the potential biases associated with individual calibrations (Smith & Peterson 2002; Soltis et al. 2002; Conroy & van Tuinen 2003; Graur & Martin 2004). However, these approaches produce compromise solutions that may artificially distort the shape of the tree (branch lengths and topology), thus calling for methods to evaluate the incongruence of calibrations.

There is indeed a growing interest in the development of procedures to assess congruence of calibration hypotheses when multiple calibrations are available. An example of a simplistic approach conditioned by the clock-like behaviour of data is the regression of calibration hypotheses to a linear equation with slope informing of the substitution rate and removal of outliers a posteriori (e.g., Gómez-Zurita 2004). Other methodology allowing for relaxed clocks is the cross-validation method described by Near, Meylan & Shaffer (2005), extended by Noonan & Chippindale (2006), Burbrink & Lawson (2007) and Clarke, Warnock & Donoghue (2011). This method, mostly applied to fossil calibrations, has been criticized for producing biased results due to the taphonomic bias in the fossil record (i.e. the increasing probability of fossil preservation towards the present; Marshall 2008; Dornburg et al. 2011). To deal with this problem, Marshall (2008) proposed a novel way to assess potential calibrations by taking an inverse approach to the cross-validation, based on the selection of a single fossil providing the oldest evolutionary time-scale. Marshall's method relies on calculations of an empirical scaling factor for each fossil using relative branch lengths of a given ultrametric, non-calibrated tree. Recently, Dornburg et al. (2011) proposed a Bayesian extension of this method by calculating distributions of scaling factors over the credible interval of branch length and topological estimates, and by assessing the overlap of the 95% highest posterior density (HPD) intervals of available potential calibrations to the one selected by the scaling factors of Marshall (2008). Finally, other methods, like those proposed by Sanders & Lee (2007) or Pyron (2010), evaluate the accuracy of a calibration relative to others assumed as reliable.

In this study, we present a novel approach, which we have called Bayes Factor Cluster Analysis (BFCA), to objectively select a group of congruent calibration hypotheses from a set of potential calibration scenarios. This methodology relies on the use of Bayes Factor (BF) comparisons (Kass & Raftery 1995; Suchard, Weiss & Sinsheimer 2001) and allows for the evaluation of both paleontological and geological calibration hypotheses introduced in Bayesian phylogenetic analyses in the form of probability density functions, as implemented in beast (Drummond & Rambaut 2007) and mcmctree (Yang & Rannala 2006). We have explored the effect of incongruence among node age calibrations on BF values and evaluated the performance of the method using simulated data. Finally, we have applied BFCA to investigate several alternative calibration scenarios to date a phylogenetic tree of the beetle genus Carabus inferred from the widely used nd5 gene, providing a real example of the effect of incorporating inappropriate calibrations in the estimation of divergence times and evolutionary rates. The method has been implemented in the open-source R package bfca. We have also developed a set of tools in the form of Perl scripts to assist in the generation of input files required for the multiple runs of beast and mcmctree, as well as the posterior processing of log files. This software is freely available from and

Materials and methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Bayes Factor Cluster Analysis Procedure

Probabilistic phylogenetic methods aim at finding the tree topology that maximizes a likelihood function for a particular data set under an evolutionary model (Felsenstein 1981). Choices affecting this model of evolution and topological constraints have an effect on likelihood estimates (Sullivan & Joyce 2005). This is the basis of different phylogenetic tests routinely conducted to compare competing hypotheses, such as the likelihood ratio test or BFs (Huelsenbeck, Hillis & Nielsen 1996; Suchard, Weiss & Sinsheimer 2001). The latter are defined in a Bayesian framework as the ratio of marginal likelihoods (i.e. the likelihood of the data under a particular model after integrating across all possible parameter values) from two competing hypotheses and have been interpreted as the relative success of each hypothesis at predicting the data (Kass & Raftery 1995; Brown & Lemmon 2007).

In principle, if several calibrations are applied simultaneously in an analysis, the more reciprocally inconsistent they are among each other and, critically, with the underlying genetic variation accounting for branch lengths, the higher will be their effect on the optimal tree shape, and consequently on the tree likelihood. Thus, BF comparisons could be potentially applied to all possible combinations of available age calibration hypotheses to investigate their mutual consistency and their fit with the data. Such an exhaustive procedure would allow identifying the most inclusive combination of calibrations resulting in differences in marginal likelihood within an acceptance threshold defined a priori by the researcher, typically with a value of 2lnBF > 2 (Kass & Raftery 1995). However, the number of combinations increases exponentially with the number of calibration hypotheses, and this procedure quickly becomes computationally intractable. To overcome this limitation, we propose a method based on analyses of pairwise calibrations and the posterior selection of the largest subset of calibrations without positive evidence of incongruence affecting any of its paired elements. The method involves four well-defined steps:

  1. Estimation of rate variation. The model of rate variation applying to the data – strict clock (SC) or uncorrelated log-normal relaxed clock (ULN) – is selected previously based on BFs from analyses without calibration information. If a ULN clock model is favoured, the standard deviation of the log-normal distribution of branch rates (ULNSD; ucld.stdev in beast; sigma2_gamma in mcmctree) is fixed in all subsequent pairwise analyses to a value estimated previously using a single arbitrary calibration for the root (in our case, a normal distribution with mean = 100, SD = 0·1). This value, based exclusively on sequence information, provides a limit to rate variation, which may otherwise compensate potential discordance among calibration priors by introducing extra variation of among-branch rates.
  2. Pairwise analyses of calibration points. All possible pairwise combinations from a collection of available calibration hypotheses are successively used in independent calibration Bayesian analyses.
  3. Estimation of Bayes factors. Marginal likelihoods for each one of the previous analyses are estimated. Here, we used the stabilized harmonic mean estimator described by Newton and Raftery (1994), with the modifications proposed by Suchard, Weiss and Sinsheimer (2001) as implemented in tracer 1.5 (Rambaut & Drummond 2007). Bayes factors for each pairwise combination of calibrations are subsequently obtained as the difference in marginal likelihood (lnBF) between the calibration pair showing the best score and each of the remaining calibration pairs.
  4. Clustering and selection of hypotheses. Bayes factors indicating positive evidence in favour of one pairwise calibration hypothesis over another are used to identify which calibrations are incongruent. Only combinations of calibrations without any such conflicting pairs are considered, and the most inclusive subset of mutually concordant hypotheses is selected for calibration analyses. Here, we considered a 2lnBF > 2 as an initial reference threshold for positive evidence as proposed by Kass and Raftery (1995), although it can be modified in bfca for flexibility in the analyses. We explored the effect of varying this threshold with simulated data (see below).


We used rateevolver 1.1 (Ho 2005) to generate an ultrametric tree of nine taxa and spanning 20 time units (Fig. 1a; from Drummond et al. 2006), and from this tree, we simulated a 1000 nucleotide-long alignment under a GTR model using seq-gen 1.3.2 (Rambaut & Grassly 1997). We used these simulated ultrametric data (i) to demonstrate our fundamental assumption that congruence of the tree calibration constraints quantitatively affects marginal-likelihood scores and (ii) to evaluate approximately the minimum degree of incongruence required between two calibration hypotheses to reflect positive evidence using BF criteria (i.e. 2lnBF > 2). Bayesian calibration analyses with varying calibration age densities were carried out both with beast 1.5.4 and mcmctree under three different clock relaxation scenarios: strict clock, ULN allowing moderate clock relaxation (ULNSD fixed to 0·3) and ULN allowing high clock relaxation (ULNSD fixed to 0·6). We kept constant the age distribution of one node (A, E or G; see Fig. 1a) and varied the calibration density of another node, which was contemporaneous, younger or older than the corresponding constant-calibrated node. Depending on the statistical distribution used for calibration, the mean or the offset assigned to the variable prior distribution varied across a range of 23 values around the real age of the node (Tables S1–S4, Supporting information). Calibration age density distributions for both constant and varying nodes were implemented in three different fashions – including normal, exponential and log-normal distributions in beast, and normal, narrow and wide skew-normal distributions in mcmctree – in order to explore their effect in the analyses. beast analyses consisted of two MCMC runs of 30 million generations, sampling every 1000 generations and were conducted using a GTR substitution model, a Yule tree prior, and constraining the ucld.stdv parameter to 0·3 or 0·6 for tests using relaxed clocks. mcmctree analyses included two MCMC runs of 100 000 generations sampling every ten generations and were carried out using an HKY substitution model – the most complex model allowed by the software for exact likelihood calculation – a Yule tree prior and constraining the sigma2_gamma parameter to 0·09 or 0·36 for the ULN clock cases (other details on priors in Tables S5 and S6, Supporting information). We discarded the initial 10% of samples as burn-in, and checked that effective sampling size for the posterior likelihood was always above 200, to ensure that analyses ran long enough to obtain stable estimations. Additionally, we compared the results of the two independent MCMC runs for each analysis, obtaining differences in marginal likelihood below 0·5 in 99·9% of cases (Fig. S1, Supporting information). For each data set, marginal likelihoods were computed, and lnBF estimated relative to the best marginal-likelihood value obtained for any of the 23 ages assigned to the variable node. LnBF values were plotted against variable node age to confirm their increase with escalation of incongruence between node ages, as well as to recognize the range of overlap below the threshold for rejection of hypothesis congruence.


Figure 1. (a) Rooted binary tree used for simulating sequence evolution. Time-scale represents arbitrary time units. (b) Hypothetical scenarios explored to test the performance of Bayes Factor Cluster Analysis (BFCA). Calibration priors were modelled as normal (Nr), exponential (Ex) or log-normal (Ln) distributions for beast analyses and using the skew-normal (SN) with varying asymmetry for mcmctree. Thick dashed lines represent calibrations congruent with real node ages and thin dashed lines those modeled as incongruent. Parameters defining distributions are indicated on each node. μ: mean, σ: standard deviation, a: lower bound, b: upper bound, ofs: offset. Normal distributions marked with an asterisk were truncated to avoid analytical problems due to possible exploration of negative height values.

Download figure to PowerPoint

Validation of Bayes Factor Cluster Analysis Performance Through Simulation

We investigated the performance of BFCA over a range of analytical conditions. We produced nine calibration scenarios, each of them with five nodes calibrated using a variety of density functions as may be carried out in a real-tree calibration exercise (Fig. 1b). Within each of these scenarios, three nodes were constrained to be mutually concordant and two to be incongruent relative to their real node ages.

Bayes Factor Cluster Analyses were carried out for the nine calibration scenarios with 90 simulated data sets generated based on an ultrametric tree of nine taxa and 20 units of time span (Fig. 1a; from Drummond et al. 2006). We used rateevolver 1.1 (Ho 2005) using a ULN clock model to generate 30 trees with moderate among-branch rate variation (ULNSD = 0·3, ULNmod) and 30 trees with high among-branch rate variation (ULNSD = 0·6, ULNhig). From each of these trees, 1000 nucleotide-long alignments were simulated using seq-gen 1.3.2 (Rambaut & Grassly 1997) under a GTR model. Similarly, 30 additional data sets were simulated from the original ultrametric tree to investigate the performance of the method under strict clock (SC) evolution. In all cases, phylogenetic dating was carried out with beast and mcmctree as indicated above (involving a total of 32 400 runs). When using a ULN relaxed clock model, we first estimated the mean ULNSD using a normal distribution (μ = 20, σ = 0·1) as an arbitrary calibration prior on the root node, and thereafter, we fixed the estimated value in subsequent analyses using pairwise calibrations (see above). We quantified: (i) the number of times that the BFCA recovered the three calibrations defined a priori as congruent and rejected the incongruent information (positive result); (ii) the number of cases where some of the a priori congruent calibrations were rejected (negative, type ρ result); and (iii) the number of cases when one or more incongruent calibrations were not rejected and were thus included in the final set along with all the congruent calibrations (uninformative, type σ result). We examined both the effect of the clock model on the rate of recovery of positive results and also the effect of varying lnBF thresholds (ranging from 0·2 to 4).

Bayes Factor Cluster Analysis of a Real Calibration Example: The Beetle Genus Carabus

After the reliability of the BFCA method was assessed on simulated data, we explored its utility on real data investigating the evolutionary rate of the nd5 gene in the genus Carabus (Coleoptera: Carabidae). The nd5 gene of Carabus counts with more than 3000 entries in public sequence databases (February 2013) and has been used repeatedly to date the evolution of this genus under several available calibration scenarios. However, the use of calibrations based on fossil and/or biogeographic data has led to inconsistent results (Prüser & Mossakowski 1998; Su et al. 1998; Tominaga et al. 2000; Osawa, Su & Imura 2004; Sota et al. 2005; Andújar, Serrano & Gómez-Zurita 2012). Thus, calibration of the nd5 tree of Carabus represents an adequate test case to apply our BFCA approach.

We built a data set using 37 nd5 sequences from public nucleotide sequence databases (Benson et al. 2010) and 21 newly generated homologous sequences from specimens selected to allow for the inclusion of several calibration points. Beetle DNA was extracted, purified and sequenced as indicated in Andújar, Serrano & Gómez-Zurita (2012) (Data S1, Supporting information). Species data, GenBank accession numbers and source for each sequence are indicated in Table S7 (Supporting information).

Alignment of protein-coding nd5 sequences did not require gaps, and the resulting manually aligned 904 nucleotide-long matrix of nd5 sequences from 58 species was used for subsequent phylogenetic analyses and the application of the BFCA test. The evolutionary model best fitting the nd5 data matrix was calculated with jModelTest (Posada 2008) and selected under the Akaike information criterion (Posada & Buckley 2004). Maximum-likelihood (ML), parsimony (MP) and Bayesian inference (BI) analyses were conducted to check the reliability of nodes used in subsequent calibration tests (details in Data S2, Supporting information).

A suite of calibration hypotheses representing different taxonomic splits in Carabus was available from the literature. We selected 16 such hypotheses which span the breakup of Gondwana, the formation of the Canary Islands, the evolution of the Western Mediterranean and the tectonic separation of the Japanese archipelago from the mainland (with two alternative time frameworks), as well as one taxonomically reliable Tertiary fossil (Table 1; Data S3, Supporting information). Calibration density functions on the age of the corresponding cladogenetic events were modelled in all subsequent calibration analyses in beast as described in Table 1.

Table 1. Calibration hypotheses employed to test the BFCA approach for the nd5 gene in the genus Carabus
Calibration hypothesisEvolutionary eventCalibration eventAge of event (Ma)Priors on node ages95% probability density95% HPD effective prior age (Final analysis)95% HPD posterior age (Final analysis)
  1. a

    See Supporting information Data S1 for additional information about calibration hypotheses.

CaSplit between two Canarian endemic species: Carabus (Nesaeocarabus) coarctatus and C. (N.) abbreviatusVolcanic emergence of Gran Canaria14·5Uniform (= 0, = 14·5)0·03–14·143·44–14·505·29–10·48
F2Carabus (Autocarabus) cancellatus fossilMessinian deposit of Cantal (France)5Log-normal (μ = 25, σ = 1·5, offset = 5)5·4–158·55·06–14·189·93–19·02
J1aRadiation of DamasterFinal disconnection of Japan from mainland3·5Truncated Normal (μ = 3·5, σ = 1, = 0·1, = 1000)1·55–5·463·19–5·872·29–4·32
J2aRadiation of LeptocarabusFinal disconnection of Japan from mainland3·5Truncated Normal (μ = 3·5, σ = 1, = 0·1, = 1000)1·55–5·463·19–5·851·26–3·13
J3aRadiation of OhomopterusFinal disconnection of Japan from mainland3·5Truncated Normal (μ = 3·5, σ = 1, = 0·1, = 1000)1·55–5·463·47–5·862·37–4·26
J4aSplit between subgenus Isiocarabus and OhomopterusFinal disconnection of Japan from mainland3·5Truncated Normal (μ = 3·5, σ = 1, = 0·1, = 1000)1·55–5·46
J1bRadiation of DamasterInitial disconnection of Japan from mainland15Normal (μ = 15, σ = 1)13·04–16·96
J2bRadiation of LeptocarabusInitial disconnection of Japan from mainland15Normal μ = 15, σ = 1)13·04–16·96
J3bRadiation of OhomopterusInitial disconnection of Japan from mainland15Normal (μ = 15, σ = 1)13·04–16·96
J4bSplit between subgenus Isiocarabus and OhomopterusInitial disconnection of Japan from mainland15Normal (μ = 15, σ = 1)13·04–16·9612·09–16·0412·76–16·50
M1Split between Carabus (Mesocarabus) riffensis and Iberian MesocarabusOpening Gibraltar strait5·33Exponential (μ = 0·5, offset = 5·3)5·31–7·14
M2Split between Carabus (Eurycarabus) genei from Corsica and North African EurycarabusOpening Gibraltar strait5·33Exponential (μ = 0·5, offset = 5·3)5·31–7·145·30–5·925·30–6·87
M3Split between two Carabus (Rhabdotocarabus) melancholicus subspeciesOpening Gibraltar strait5·33Exponential (μ = 0·5, offset = 5·3)5·31–7·145·30–5·905·30–6·49
M4Split between two Carabus (Macrothorax) morbillosus subspeciesOpening Gibraltar strait5·33Exponential (μ = 0·5, offset = 5·3)5·31–7·14
M5Split between two Carabus (Macrothorax) rugosus subspeciesOpening Gibraltar strait5·33Exponential (μ = 0·5, offset = 5·3)5·31–7·14
NZSplit between Pamborus and MaoripamborusSeparation of New Zealand from Australia-Antarctica85Normal (μ = 85, σ = 3)79·12–90·88

To select the optimal clock and nd5 data partition models for the data, we conducted BF comparisons evaluating six different schemes in beast 1.5.4: (i) SC without partitioning [NP], (ii) SC and two partitions considering first and second codon positions together [2P], (iii) SC and each codon position as a different partition [3P], (iv) ULN and NP, (v) ULN and 2P, and (vi) ULN and 3P. We used a GTR + I + G substitution model with a gamma distribution estimated using four rate categories. A speciation Yule process was used as tree prior, and a normal distribution (μ = 100, σ = 0·1) was set as an arbitrary calibration prior for the root. Two independent MCMC chains were run for 50 million generations, sampling every 2000th generation. Log files were combined after removing 20% of the initial values as burn-in, and marginal-likelihood scores were estimated as described before. The estimated difference in marginal likelihood (lnBF) was interpreted as requiring at least a ten units increase per additional free parameter (p) (ratio lnBF/Δp higher than 10) before accepting a more complex model (Pagel & Meade 2004; Miller, Bergsten & Whiting 2009). We assumed one extra parameter for the ULN clock relative to the SC as suggested by Drummond et al. (2006).

The selected clock and partition models were used to perform a BFCA of the 16 calibration hypotheses (four analyses including pairs of calibrations affecting the same node were excluded because they represented alternative time frameworks for the evolution of the Japanese archipelago; Table 1). The obtained subset of congruent calibration hypotheses along with the preferred clock and partition models were used to infer the phylogeny of the genus Carabus and to estimate the molecular rate of evolution for the nd5 gene. In this case, we ran two independent runs in beast as above (details in Table S8, Supporting information), but increased the number of gamma categories to ten to improve the accuracy in the estimation of the evolutionary rate. Additionally, we evaluated the effect of different calibration strategies on rate estimation, using each calibration individually and also the combination of all calibrations simultaneously. Rate values are indicated as their mean and 95% HPD interval.


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Evaluation of Incongruence Using Bayes Factors

A simulated dataset representative of the tree in Fig. 1a was investigated to evaluate the interaction of calibrations with increasing incongruence. The eight combinations of constant vs. variable node age priors produced very similar results independently of the methodology used, generally representing an increasing difference in marginal-likelihood scores as the calibration for the age of the variable node departed from its true value, but also depending on the age interval allowed by the corresponding probability density functions (Figs S2 and S3, Supporting information). Figure 2 shows an example of results obtained in beast when the calibration prior of node A was kept constant and that of node B varied along the age gradient.


Figure 2. Results of the initial evaluation of incongruence limits for each of the nine paired combinations of prior distributions on nodes A and B from Fig. 1a (Nr: normal; Ex: exponential; Ln: log-normal) under SC, ULN with fixed ULNSD = 0·3 and ULN with fixed ULNSD = 0·6. Constraints on node A remain constant while the age constraint of node B vary along a range of values (X axes). Node A prior: Nr (μ = 5, σ = 0·5); Ex (μ = 0·5, offset = 5); Ln (μ = 12, σ = 1, offset = 5). Node B prior: Nr (μ = y, σ = 0·5); Ex (μ = 0·5, offset = y); Ln (μ = 12, σ = 1, offset = y), whereby = 0·5, 1, 1·5, 2, 2·5, 3, 3·5, 4, 4·5, 5, 5·5, 6, 6·5, 7, 7·5, 8, 8·5, 9, 9·5, 10, 11, 13. Y axes: lnBF. Vertical line indicates the real age of the variable node (node B). The remaining examples are given in Supporting information Figs S2 and S3.

Download figure to PowerPoint

Any combination of normal (Nr) and exponential (Ex) or narrow skew-normal (SNn) calibrations produced curves with lnBF values growing both sides asymmetrically from a minimum lnBF value (≈0) centred on the real age value of the variable node when its constrained age slid both towards the past or the present. This trend represents the increase in lnBF value as the variable calibration prior departs from the real value until it eventually exceeds the threshold selected to indicate positive evidence in favour of one hypothesis. The usage of wide calibration distributions, as we did for log-normal (Ln) and wide skew-normal (SNw) distributions, resulted in higher chances of congruence between calibrations, as expected. These wide calibrations showed an asymmetrical effect in hypotheses concordance, because of their shape including a hard bound and a long tail. Consequently, in cases where both calibrations were modelled with wide distributions, no trend towards incongruence was observed for the tested range of age differences.

The main difference between clock models was the noticeable and expected decrease in difference between marginal-likelihood values when a relaxed clock was used. Relaxed clocks broadened the range for hypothesis congruence to the point that, when high rate variation (ULNSD = 0·6) was allowed, incongruence was not reached in most cases. Otherwise, SC and ULN with moderate among-branch rate variation (ULNSD = 0·3) produced similar results, but expectedly, the calibration incongruence for positive evidence favouring one hypothesis over another was clearly broadened with the relaxed clock (Fig. 2). In the analytical conditions used for this example, the range of age differences between nodes where the test did not discriminate a pair of calibrations as inconsistent varied between 1·7 and 7·7 time units (depending on the relative positions of nodes; see 'Discussion') in the SC case. This range of age differences varied between 4·5 and 22·4 time units in the ULN allowing moderate clock relaxation (Tables S9 and S10, Supporting information). Thus, clock relaxation accommodated part of the incongruence and, as a consequence, the difference in age between calibrations required to produce positive evidence for one hypothesis over another increased almost threefold. Finally, when high clock relaxation was allowed, the same trend was accentuated and much higher age differences between calibrations were required to produce positive evidence for a hypothesis based on lnBF values.

Bayes Factor Cluster Analysis on Simulated Data

The performance of the BFCA method was evaluated quantitatively based on the rate of correct selection of congruent calibrations and rejection of incongruent ones on a simulated phylogeny of known relationships and ages of branching events. Test performance was investigated under different calibration scenarios using different probability density functions as priors (Fig. 1b), as well as simulating strict or relaxed clocks, the latter with either moderate or high rate variation among branches (ULNmod and ULNhig, respectively).

Each of the nine calibration scenarios was investigated for 90 iterations of simulated data (thirty for each clock model; 810 BFCA tests in total). Results distinguishing between SC, ULNmod and ULNhig cases for a threshold of lnBF > 1 are summarized in Table 2 for beast and Table 3 for mcmctree. Calibration scenarios simulated under SC produced 95·2% of positive results in beast and 88·2% in mcmctree. The scenario that presented more difficulties for the correct identification of congruent calibrations was 3a, producing 87·0% and 66·6% of positive results in beast and mcmctree, respectively (the others ranged 93–100% in beast and 76·6–100% in mcmctree).

Table 2. Summary of the results for the 810 BFCA runs to test method performance through simulation in beast. Columns gather the results under a strict clock model (SC), and uncorrelated log-normal relaxed clock models with moderate (ULNmod) and high (ULNhig) among-branch rate variation. Columns 1a to 3c indicate the nine different calibration scenarios tested (as shown in Fig. 1b). Positive results are represented with a + sign, negative type ρ results with a – sign and uninformative type σ with a 0. The mean of the estimated standard deviation of the log-normal distribution of evolutionary branch rates obtained from beast analyses using a normal distribution (μ = 20, σ = 0·1) as an arbitrary prior for the age of the root is also shown (Mean ULNSD)
 SCMean ULNSD (ULNmod)ULNmodMean ULNSD (ULNhig)ULNhig
Positives282930292830262928 910819171181413 010210021
% Positives95·2% (257/270) 40·4% (109/270)2·6% (7/270)
Table 3. Summary of the results for the 810 BFCA runs to test method performance through simulation in mcmctree. Columns gather the results under a strict clock model (SC), and uncorrelated log-normal relaxed clock models with moderate (ULNmod) and high (ULNhig) among-branch rate variation. Columns 1a to 3c indicate the nine different calibration scenarios tested (as shown in Fig. 1b). Positive results are represented with a + sign, negative type ρ results with a − sign and uninformative type σ with a 0. The mean of the estimated standard deviation of the log-normal distribution of evolutionary branch rates obtained from mcmctree analyses using a normal distribution – SN (0·2, 0·01, 0) – as an arbitrary prior for the age of the root is also shown (Mean ULNSD)
 SCMean ULNSD (ULNmod)ULNmodMean ULNSD (ULNhig)ULNhig
Positives232930292724192730 58314965138 000000000
% Positives88·2% (238/270) 26·3% (71/270) 0%

Simulations under the ULNmod clock model generated 40·4% of positive results, 55·6% of type σ uninformative results and 4% of type ρ negative results in beast. In the mcmctree analyses, we obtained 27·3% of positive results, and 72·3% and 0·4% of σ and ρ results, respectively. Interestingly, both programs identified the same most problematic scenarios (scenarios 3a, 1a and 1c). For both methodologies, there was a strong negative correlation (Pearson's r = −0·89, < 0·001) between the proportion of positive results and the estimated mean of the ULNSD parameter. Thus, when the estimated ULNSD mean was below 0·35, the proportions of positive results were 75·7% and 81·5% in beast and mcmctree, respectively. The proportion of positive results fell to 39·2% and 17·1% for ULNSD values between 0·35 and 0·46. Finally, simulated ULNhig data sets produced a very low proportion of positive results (2·6% in beast, 0% in mcmctree), with a high proportion of type σ uninformative results (97·0% in beast and 99·3% in mcmctree). Following the same previous trend, positive results were only obtained in cases where the estimated mean of ULNSD was relatively low (<0·59).

Test stringency had an effect on the percentage of recovery of positive results but depending on analytical conditions and the method of choice. Lower lnBF thresholds yielded a lower proportion of positive results, whereas increasing the threshold above lnBF > 1·5 resulted already in 99% of positive results when an SC was applied for both beast and mcmctree analyses (Table 4). Increased clock relaxation progressively reduced the threshold maximizing the proportion of positive results, from lnBF = 1·0 (ULNSD < 0·35) to 0·3 (ULNSD > 0·46) in the case of beast analyses and lnBF = 0·7 to 0·3 in those using mcmctree, the latter approach being more sensitive to small differences in likelihood values to discriminate among incongruent hypotheses (Table 4). The best result in this series of analyses using ULNmod relaxed clocks was 89% of positive results with a threshold lnBF = 0·7 in mcmctree analyses for low ULNSD values (Table 4). For ULNSD > 0·46, BFCA failed to produce positive results nearly always with lnBF ≥ 1 (6%), but it reached a percentage of correct answers above 50% for a threshold 0·3 < lnBF < 0·4 (Table 4). Finally, increasing the acceptance threshold had a clear effect reducing the proportion of type ρ negative results, which fell below 5% depending on analytical conditions (e.g. lnBF < 1·0 for SC under beast).

Table 4. BFCA results obtained for simulated data using beast and mcmctree and different acceptance threshold values. The results are shown as (x/y/z) representing the percentages of positive (only congruent calibration points are included in the BFCA solution; x), of negative type ρ results (some congruent calibration is excluded from the BFCA solution; y) and uninformative type σ results (some incongruent calibration is not excluded from the BFCA solution; z), respectively
Threshold beast mcmctree
SCULNSD < 0·350·35 < ULNSD < 0·46ULNSD > 0·46SCULNSD < 0·350·35 < ULNSD < 0·46ULNSD > 0·46
  1. a

    Threshold maximizing the number of positive results.

  2. b

    Threshold with type ρ negative results ≤ 5% (low probability of rejecting a congruent prior).


Bayes Factor Cluster Analysis for nd5 Data in Carabus

Phylogenetic analyses of the 904 nt nd5 matrix of Carabus and using different phylogenetic inference methods resulted in very similar topologies, with differences only found at the deepest nodes, which were not resolved under MP. However, even in this case, no topological incongruence was observed for highly supported nodes (Fig. 3). Nodes used for calibration received posterior probabilities between 0·92 and 1·00. Only the split represented by J4 received low bootstrap support and moderate posterior probability (0·71), but this split between Isiocarabus and Ohomopterus was confirmed by phylogenetic analyses using multiple loci (Sota & Ishikawa 2004). The BF comparison of Bayesian analyses for six different partitioning schemes and strict vs. relaxed clock models identified the 2P data partitioning scheme and SC model as optimal for nd5 data in Carabus (Table 5). Under this partitioning scheme, relaxed clock analyses yielded an ucld.stdev mean value = 0·089 and a truncate distribution abutting 0 (95% HPD interval 3·65 × 10−6–0·20), which indeed suggests a negligible bias against the clock-like behaviour of data.

Table 5. Bayes factor comparisons for selection of clock model and partitioning scheme based on beast analyses for the nd5 gene in the genus Carabus
Partition/clock schemepMarginal likelihoodNP/SCNP/ULN2P/SC2P/ULN3P/SC3P/ULN
  1. p, Total number of free parameters required for each model and partitioning scheme; SC, strict clock; ULN, uncorrelated log-normal clock; NP, no codon partitioning; 2P, two partitions considering first and second codon positions together; 3P, three partitions considering each codon position as a different partition.

  2. a

    Above the diagonal: lnBF/Δp (where Δp: difference in total number of free parameters between two models).

  3. b

    Below the diagonal: lnBF.


Figure 3. Ultrametric tree obtained with beast for the Carabus nd5 data set with the set of calibration hypothesis C, F, J1a, J2a, J3a, J4b, M2 and M3, selected using BFCA. Numbers on nodes indicate support with Bayesian/ML/MP analyses, respectively. Bars correspond to the 95% Highest Posterior Density intervals for node ages.

Download figure to PowerPoint

Thus, the BFCA procedure was applied to nd5 data under 2P/SC and considering the set of 16 potential node calibration hypotheses, including alternative age constraints for nodes related to the origin of the Japanese fauna. Table 6 summarizes the marginal-likelihood values for each of the resulting 116 pairwise combinations of calibration hypotheses and their lnBF value relative to the optimal result. BFCA identified a single group of eight calibration constraints, namely C, F, J1a, J2a, J3a, J4b, M2, M3, which consistently showed lnBF values of pairwise analyses below the threshold lnBF > 1. Indeed, any combination of nine or more calibration points (or any other combination of eight hypotheses) included pairwise analyses with lnBF > 2·6, strongly arguing against combinability (Kass & Raftery 1995). Thus, this group of eight dating node constraints was selected as the most inclusive set of concordant calibration hypotheses.

Table 6. Values of marginal likelihood (above diagonal) and lnBF comparisons (below diagonal) relative to the best marginal-likelihood value for each pair of calibration hypotheses proposed for the Carabus nd5 data set. The pair of calibration hypotheses that result in the best marginal likelihood value are underlined. Best marginal likelihood value in bold. Codes for calibration hypotheses are as in Table 1. Hypotheses included in the most inclusive set of congruent calibrations selected by BFCA are marked with an asterisk
C* −9706·54−9706·46−9714·49−9706·38−9721·21−9706·58−9715·52−9706·56−9706·55−9706·18−9706·00−9706·25−9707·04−9707·16−9713·11
F*0·555 −9706·67−9706·76−9706·75−9707·35−9706·82−9707·07−9706·80−9706·52−9706·18−9706·54−9706·61−9706·82−9707·29−9706·14
J1a*0·4820·684 n/a−9706·80−9718·68−9706·49−9712·99−9706·78−9706·13−9706·69−9706·59−9706·56−9707·24−9707·67−9710·55
J1b8·5130·781n/a −9707·44−9707·52−9712·10−9706·24−9743·10−9722·17−9725·13−9717·94−9717·05−9709·42−9708·64−9706·78
J2a*0·3940·7690·8151·456 n/a−9706·94−9707·65−9706·57−9706·59−9706·90−9706·42−9706·59−9706·20−9706·12−9706·24
J2b15·2321·36612·6981·543n/a −9718·56−9707·66−9751·53−9730·02−9731·15−9723·55−9722·59−9714·10−9712·97−9711·03
J3a*0·5980·8400·5076·1230·96112·583 n/a−9706·87−9706·26−9706·60−9706·52−9706·19−9706·70−9707·11−9710·04
J3b9·5391·0907·0140·2581·6671·677n/a −9731·92−9725·47−9724·91−9718·06−9717·13−9709·66−9709·10−9706·88
J4a0·5780·8180·79437·1200·59045·5520·88725·944 n/a−9706·68−9710·23−9711·72−9721·03−9725·66−9744·05
J4b*0·5680·5370·15116·1890·61124·0380·27619·492n/a −9708·76−9706·229705·98−9708·04−9708·95−9721·39
M10·1960·1990·71019·1520·91625·1680·62218·9250·6972·780 −9707·62−9708·73−9714·79−9716·39−9722·51
M2*0·0150·5600·60511·9630·44117·5740·53412·0804·2480·2381·642 −9706·13−9709·49−9711·27−9716·29
M3*0·2660·6310·58011·0700·60716·6070·21111·1495·741 0·000 2·7480·148 −9708·61−9709·83−9715·02
M41·0580·8411·2623·4430·2228·1200·7213·67615·0492·0558·8123·5052·633 −9706·08−9707·45
M51·1771·3111·6892·6570·1366·9851·1323·12019·6762·97110·4095·2913·8500·102 −9706·84

The calibrated Bayesian nd5 phylogeny of Carabus using the set of calibrations selected by BFCA resulted in a mean evolutionary rate for nd5 of 0·0154 (95% HPD: 0·0112–0·0198) substitutions per site per Ma per lineage (subs./s./Ma/l.), equivalent to 3·08% divergence between two lineages per Ma, and a time for the most recent common ancestor (TMRCA) of Carabus and Calosoma of 36·2 Ma (95% HPD: 27·64–45·72). For comparative purposes, the evolutionary rate and age were also estimated with the simultaneous use of all calibration constraints using alternative Pliocene or Miocene scenarios for Japanese taxa. The estimated mean rates were remarkably lower than those obtained based on the BFCA selection of calibration hypotheses: 0·0129 (95% HPD: 0·0095–0·0162) subs./s./Ma/l. or 0·0073 (95% HPD: 0·0056–0·0091) subs./s./Ma/l., respectively, and the TMRCA of Carabus and Calosoma were remarkably higher: 77·7 Ma (95% HPD: 71·94–83·65) and 84·7 Ma (95% HPD: 76·22–94·96), respectively. Calibration hypotheses individually analysed produced rate estimates ranging from 0.0021 subs./s./Ma/l. (95% HPD: 0.0012?0.0032; constraint J2b) to 0.8254 subs./s./Ma/l. (95% HPD: 0.1602?1.4033; constraint C) and TMRCA of Carabus and Calosoma ranging from 268·8 Ma (95% HPD: 150·60–399·27; constraint J2b) to 0·9 Ma (95% HPD: 0·33–2·12; constraint C) (Table S11, Supporting information).


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Importance of Using Reliable Calibrations

A major advance in the process of phylogenetic dating has been to include calibration data as probability distributions that account for the different sources of uncertainty in the age of calibrations (Inoue, Donoghue & Yang 2010). Yet, the selection and appropriate use of calibrations has been reported as probably the most persistent problem in dating analyses (Heads 2005) with a major effect on estimated ages (Inoue, Donoghue & Yang 2010). Because of potential bias of using single calibration points and favoured by the development of relaxed clock models, the simultaneous use of all information available to calibrate a tree has been recommended as a way to minimize these problems (Yang & Rannala 2006; Ho & Phillips 2009). However, the indiscriminate application of all available calibration data can lead to biased consensus solutions in which incongruence is accommodated at the cost of inferring highly artificial levels of rate variation, producing both incorrect local and global rates even when relaxed clock approaches are used. Although some recent work indicates that the incorporation of incorrect calibrations has a limited effect when relaxed clocks are used in combination with soft bounds distributions (Yang & Rannala 2006), systematic studies assessing the impact of the incongruent calibrations are still lacking. However, it is plausible that the inclusion of extremely incongruent calibrations will have an effect on the estimation of divergence times and substitution rates, especially when taxonomic sampling and genetic data are limited.

Thus, calibration analyses could, in principle, be improved by filtering the data to identify sets of calibration points that are consistent with one another, addressing possible dating errors by mutual corroboration of independent data. Thus, our main contribution to the calibration problem through BFCA is the development of a heuristic test that enables the identification of these sets of congruent calibrations taking into account the uncertainty associated with each one of them, which is incorporated with a particular probability density function.

Bayes Factor Cluster Analysis Performance

We have explored the effects on marginal-likelihood scores of the simultaneous application of pairs of calibrations with varying degrees of conflict under different scenarios of clock relaxation. Our main observation is that, as predicted, conflicting calibrations affecting relative branch lengths indeed produce a progressive decrease on marginal likelihood in comparison with unconstrained (or optimally constrained) analyses, that is, marginal likelihood is negatively correlated with the magnitude of the conflict.

Our simulations using pairs of calibration points showed some interesting trends relevant to understand BFCA. One such trend is that the sensitivity of the test depends on the relative position of the calibrated nodes. Thus, for pairs of calibrations represented by normal, exponential or narrow skew-normal functions, deeper nodes seem to admit wider departure from their real age before favouring one hypothesis over another. Furthermore, we also observed that the more separated the nodes are on the tree, the harder it is to detect the incongruence. Nonetheless, the most important trend is related with the use of relaxed vs. strict clock models. In this regard, we found that the discrimination power of BFCA is highly dependent on the level of among-branch rate variation (i.e. the value of ULNSD). When moderate levels of rate variation were allowed (ULNSD = 0·3), the calibration needed a departure from the real value almost three times higher than in the case of SC to yield values of lnBF > 1. High levels of rate variation (ULNSD = 0·6) worsened the trend, allowing the accommodation of any calibration in the range of incongruence examined. We demonstrated that calibration conflicts are accommodated by increasing the variation of evolutionary rates among branches when using relaxed clocks, and therefore, such variation is inversely correlated with the discrimination power of the BFCA.

We also evaluated the performance of BFCA through simulation under different clock models and calibration scenarios. We distinguished two situations to describe BFCA departures from the expected result: negative (type ρ) and uninformative (type σ) results. In the first case, correct calibrations are recognized as conflictive, whereas in the second case, incongruent calibrations are not recognized as such and are included in the BFCA solution along with the correct calibrations. As expected from our simulations using pairs of calibrations, the clock model is the main factor conditioning the power of the method. Thus, BFCA performance is very high for data sets simulated under a SC, identifying the right set of congruent calibration hypotheses in 95·0% and 89·2% of cases in our tests with beast and mcmctree, respectively (threshold of lnBF > 1). For simulations with moderate rate variation, the proportion of positive results is maximized with lnBF thresholds between 0·3 and 1, depending on the estimated ULNSD of the data and the program used. The optimum threshold for the data sets showing ULNSD < 0·35 allows detecting the correct set of calibrations in 76% of cases in beast and 89% of cases in mcmctree, while maintaining a relatively low proportion of negative (type ρ) results (12% and 2% of cases, respectively). These observations highlight some of the limitations of BFCA when dealing with alignments that show high variation in substitution rates among taxa. The power of the BFCA is low under these conditions because the high rate variation among the branches of the tree together with the limited information of the sequences (1000 bp in the simulated analyses) allow the accommodation of incongruent calibrations with a negligible cost in likelihood. Ongoing investigations show that this undesirable effect is highly dependent on the length of the sequences, in agreement with the findings of Yang & Rannala (2006). These preliminary analyses indicate that incongruent calibrations have a higher effect on the likelihood of analyses when using longer, more informative, sequences, and consequently providing a best performance of BFCA with relaxed clocks models (C. Andújar, V. Soria-Carrasco & J. Gómez-Zurita, unpublished data).

Expectedly, the proportion of positive results as deduced from simulated data depends on the lnBF threshold used (Table 4). In the case of our SC simulated data sets, increasing the threshold to 1·5 resulted in the recovery of 99% of correct answers, and therefore, thresholds between 1 and 1·5 (Kass & Raftery 1995) would be recommended for data sets under strict clock evolution and independently of the software used. In cases where a relaxed clock applies, there seems to be a compromise threshold between a situation where congruent priors are excluded with high probability (negative results) for low acceptance thresholds, and one where incongruent priors start to be included in the solution (uninformative results) for high acceptance thresholds. Thus, for simulated data sets with ULNSD > 0·35, the recovery of positive results with lnBF = 1 is very low, but it improves to 50–60% for 0·3 < lnBF < 0·5, while for ULNSD < 0·35, the situation does not seem to differ notably from the SC scenario except for a better performance for slightly lower acceptance thresholds, in the range of lnBF = 1·0. It might then be advisable to use lnBF thresholds as low as lnBF = 0·5 for data sets with high among-branch rate variation. As indicated above, higher thresholds reduce negative and increase uninformative results. The proportion of negative results can be interpreted as the assumed risk to exclude a correct calibration, and, consequently, higher thresholds will ensure the recovery of all the correct calibrations (but at the cost of increasing the probability of including incorrect ones). For example, in the case of ULNSD > 0·35 data sets, the threshold of 0·5 implies a proportion of negative results between 9% and 0% that needs to be taken into account. It is important to emphasize that the threshold customization as implemented in the R package bfca (threshold parameter) is an important feature that allows exploring BFCA results under different analytical conditions, an advisable procedure when dealing with complicated cases where among-branch rate variation is significant.

Bayes Factor Cluster Analysis and nd5 Evolutionary Rate in Carabus

The study of the evolutionary rate of the nd5 gene in the genus Carabus illustrates some of the problems mentioned above and stresses the suitability of the BFCA to identify consistent calibration data. While the data fitted a clock-like evolution, available calibration hypotheses analysed individually produced evolutionary rate estimates ranging from 0·0021 to 0·8254 subs./s./Ma/l and TMRCA of Carabus and Calosoma ranging from 268·8 to 0·9 Ma (Table S11, Supporting information). Facing this range of potential results spanning two to three orders of magnitude, one recurrent solution has been averaging using all points (Yang & Rannala 2006; Ho & Phillips 2009), which would result in a somewhat intermediate value.

Among 16 available calibration hypotheses for Carabus, the use of BFCA identified eight as mutually congruent, including (i) the age F of the fossil C. cancellatus, (ii) J1, J2 and J3, placing the diversification of Japanese species of subgenera Damaster, Leptocarabus and Ohomopterus around 3·5 Ma, (iii) J4b, concordant with the Miocene split of mainland Carabus lineages and Japanese Ohomopterus, (iv) M2 and M3 corresponding to the split at the end of the Messinian of African and European species of Eurycarabus and subspecies of Carabus melancholicus and (v) C, assigning 14·5 Ma as the maximum age for the split of Nesaeocarabus within the Canary Islands. Fig. 3 shows an ultrametric tree calibrated simultaneously using the selected calibrations for these eight nodes.

Calibration hypotheses discarded by BFCA can be reinterpreted in the light of the dated phylogeny of the group. The allopatric distribution of European Mesocarabus lineages and North African C. (Mesocarabus) riffensis as well as the allopatric ranges of European and North African subspecies of both C. (Macrothorax) morbillosus and C. (Macrothorax) rugosus has been interpreted as the result of Messinian vicariance (e.g. Prüser & Mossakowski 1998). However, our results for Mesocarabus indicate that the split occurred earlier, in the Late Miocene (M1: ~10·5 Ma; 95% HPD: 7·6–13·5 Ma), and are consistent with a Betic-Riffian origin of the group (Andújar et al. 2012), or more recently in the case of Macrothorax, which started to diverge at the Plio-Pleistocene border (M4: 3 Ma; 95% HPD: 1·8–4·3 Ma; and M5: 2·5 Ma; 95% HPD: 1·6–3·6 Ma), well after the refilling of the Mediterranean Sea. Particularly for the latter, post-Messinian transmarine dispersal is not at odds with the natural history of these large beetle species. In contrast to most other Carabus species, C. morbillosus has relatively long and innervated wings (Ortuño & Hernández 1992), probably enhancing their dispersal power. Supporting the idea of recent dispersal, highly similar mtDNA haplotypes were found between Tunisian and Sardinian individuals of this species (Prüser & Mossakowski 1998). For faunas in remote continental islands of the Southern Hemisphere, it is tempting to invoke continental drift as the explanation for allopatric distributions, and this was the case for Australian Pamborus and Maoripamborus from New Zealand (Sota et al. 2005). Nonetheless, dating this event with an Upper Cretaceous age interval as modelled here is highly inconsistent with most other calibration points, as revealed by BFCA, and would dramatically reduce the evolutionary rate for nd5 to 0·0044 subs./s./Ma/l. (95% HPD: 0·0031–0·0059), slower than the slowest mtDNA protein-coding genes in Coleoptera (nad4L; Pons et al. 2010). The effect of sequence saturation and limitations of the evolutionary model in estimating the age of this node (older than others here studied) may affect BF estimations favouring the exclusion from the optimum solution. It is worth noting that BFCA does not protect against the caution that should be taken when estimating the ages of nodes beyond the limit imposed by the saturation of markers.

The evolutionary rate for nd5 in Carabus estimated after BFCA testing, 0·0154 (95% HPD: 0·0112–0·0198) subs./s./Ma/l., is higher than values previously reported for this or other mtDNA genes of Carabus, but remarkably similar to the estimated rate for this gene in Coleoptera (0·0168 subs./s./Ma/l.; 95% HPD: 0·0086–0·0279; Pons et al. 2010). The discrepancies with previous attempts can be explained because of the use of flawed calibration points (e.g. tectonic vicariance for Maoripamborus; Sota et al. 2005), by constraints applied to wrong nodes (e.g. Japanese radiations of Carabus defined by their sister relationship with continental relatives; Su et al. 1998; Tominaga et al. 2000; Osawa, Su & Imura 2004), or by using inappropriate corrections of genetic distances among species (e.g. underestimated nd1 divergences for Rhabdotocarabus; Prüser & Mossakowski 1998).

BFCA As a Test for Selecting Congruent Calibrations

Our BFCA approach proved to be a useful tool to objectively select concordant calibration constraints, with evidence supplied by BF comparisons. BFCA exploits the analytical strengths of Bayesian methodologies allowing the application of relaxed clock models and the incorporation of both fossil and geologic data in the form of calibrations defined by probability density functions. Thus, BFCA overcomes the limitations of methods that rely on point calibrations, such as the cross-validation fossil method of Near, Meylan and Shaffer (2005) or Marshall's (2008) method to deal with taphonomic biases on the fossil record. There are other parametric approaches, like the Bayesian extension of Marshall's method by Dornburg et al. (2011), based on the selection of the best fossil to compare with the complete pool of calibrations. However, these methods are limited to calibration exercises based on fossils. This limitation does not condition BFCA given that parametric prior distributions can represent any source of dating information, from specific fossils to a geographical event relevant in phylogenetic terms (e.g. vicariance). There are yet other methods that use probabilistic distributions as priors and evaluate the accuracy of a proposed calibration with respect to others, which are assumed to be reliable (Sanders & Lee 2007; Pyron 2010). Unfortunately, information about the reliability of calibration hypotheses, the crux of tree calibration practice, is often lacking. Instead, BFCA has the advantage that it operates on the assumption that reliable independent calibrations must be necessarily congruent among them and will tend to outnumber erroneous ones, especially if they have been selected following rigorous procedures (Parham et al. 2011). BFCA can be applied with both strict and relaxed clock models, and it is able to identify incongruent calibration hypotheses even in cases with moderate rate variation among branches. Besides, the applicability of BFCA with relaxed clocks can take advantage of the possibility of including priors of among-branch rate variation based on information from previous analyses of similar molecular markers, restricting the artefact overestimation in rate variation required to accommodate incongruence among calibrations. Lastly, it is worth noting that, although we centred our analyses in the use of certain models implemented in beast and mcmctree, the method is equally applicable to the results produced by any other Bayesian phylogenetic inference or dating method, as long as they are amenable of including calibration information in some way.

One practical drawback of BFCA is that it is computationally demanding because of the number of independent analyses required. However, the application of BFCA remains feasible even for problems involving many calibration points as long as parallel computing using small high-performance clusters is available. This can be achieved because the method is based on pairwise analyses instead of carrying out BF comparisons among all possible combinations of hypotheses, which quickly becomes computationally intractable. For example, the analysis of 16 calibration hypotheses in the case of Carabus required 120 independent beast runs with the BFCA approach, in contrast to the 65 519 runs that would be needed when considering all possible combinations of calibrations.

It should be noted that the BFCA is a congruence-based method that should be used with three or more calibrations, and evidence for congruence is only admitted if at least three calibrations are recovered as congruent. Another aspect that may potentially affect the BFCA is the way in which calibrations are incorporated by the Bayesian software as node prior ages into the analyses. For example, it is known that the interaction among the calibration age and the tree prior may result in effective joint priors different from the defined calibration densities (Heled & Drummond 2012). How this may affect the reliability of calibration analyses and the evaluation of calibration hypotheses with the BFCA approach should be addressed in the future. Lastly, accuracy in the estimation of marginal likelihoods could potentially have an influence in our methodology. Harmonic estimators such as the variant used here (Suchard, Weiss & Sinsheimer 2001) have been shown to lead to erroneous results when used to compare between evolutionary models which differ in the number of their parameters (Lartillot & Philippe 2006; Baele et al. 2012a,b). However, BFCA results are based on the comparison of analyses where the number of parameters remains constant, on the evaluation of identical models with different calibration priors, and therefore, this bias is not known to affect our approach. Particularly in the case of the simulations used here, HME estimates always reached good convergence, as reflected also in the negligible differences between pairs of runs for the same analysis (Fig. S1, Supporting information), and very convincingly in the clear trends observed in the simulations related to our Proof-of-principle. In any case, it will be of particular interest to further investigate the behaviour of the BFCA by using more accurate methods for the estimation of marginal likelihoods, such as thermodynamic integration (Lartillot & Philippe 2006) or stepping-stone sampling (Xie et al. 2011).

Bayes Factor Cluster Analysis addresses several of the limitations for selection of calibration hypotheses and clock calibration. However, it is important to keep in mind that results of BFCA are much dependent on the quality of calibration hypotheses and decisions on prior modelling for the analyses, which rely on previous choices independent of the BFCA method. For example, uncertain calibrations modelled with wide prior distributions, such as the broad log-normal distribution that we used, have little effect on the shape of the tree and have, therefore, high chances of being included in the final BFCA solution. Moreover, a biased selection of incorrect calibration constraints (e.g. due to taphonomic bias in the case of fossils) or their systematic incorrect placement in a phylogeny will produce erroneous divergence times and evolutionary rate estimations, independently of the method employed for filtering hypotheses (Parham et al. 2011). An example of the latter situation is shown by the effects of alternative ages applied to putative vicariance events in the case of the geologically complex history of the Japanese archipelago for the genus Carabus. Other important factors that will also require further scrutiny are the effect of saturation due to multiple substitutions, topological relationships among taxa, or the influence of different proportions of congruent and incongruent calibrations, but they are all amenable to systematic study using our new analytical BFCA tool and software.


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

This work was supported by the Spanish Ministry of Science and Innovation (CGL2006/06706 and CGL2009-10906 to C.A. and J.S. and CGL2008-00007 to J.G.-Z., the latter also with support of the European Regional Development Fund). CA received support from an FPU predoctoral studentship (Spanish Ministry of Education). Anna Papadopoulou and Ignacio Ribera (Institute of Evolutionary Biology, Barcelona, CSIC) discussed with us many of the ideas developed here, read a preliminary version of this work and provided with very constructive criticism to significantly improve the study. Brent Emerson (IPNA, CSIC, Spain), Alfried Vogler (Natural History Museum, London), Carlos Ruiz (University of Murcia) and Paula Arribas (University of Murcia) also helped with their comments and advice. Achille Casale (University of Sassari) helped with the identification of some Carabus taxa. Thanks are due to Obdulia Sánchez, Ana Asensio, José Luis Lencina (University of Murcia) and Gwenaelle Genson (CBGP Montpellier) for their technical assistance. Most of computational analyses were performed using the Ben Arabi supercomputer of the Fundación Parque Científico de Murcia (Murcia, Spain) and the Iceberg HPC cluster of the University of Sheffield (Sheffield, UK).


  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Figure S1. Histogram with the lnBF values obtained for the comparisons of the two independent runs conducted on each of the 13 068 the simulation analyses in beast (a) and mcmctree (b).

Figure S2. Results of the initial evaluation of incongruence limits in beast.

Figure S3. Results of the initial evaluation of incongruence limits in mcmctree.

Table S1. Prior distributions used in the initial evaluation of incongruence limits in beast.

Table S2. Range of age values used for the initial evaluation of incongruence limits in beast.

Table S3. Prior distributions used in the initial evaluation of incongruence limits in mcmctree.

Table S4. Range of age values used for the initial evaluation of incongruence limits in mcmctree.

Table S5. Data about parameters used in the simulated calibration analyses conducted in beast.

Table S6. Data about parameters used in the simulated calibration analyses conducted in mcmctree.

Table S7. Data about specimen and nd5 sequences employed to test the performance of BFCA on a real DNA data set in the genus Carabus.

Table S8. Data about prior parameters used in the BFCA calibration analyses of the nd5 Carabus data set conducted in beast.

Table S9. Results of the initial evaluation of incongruence limits in beast.

Table S10. Results of the initial evaluation of incongruence limits in mcmctree.

Table S11. Evolutionary rates and tree root ages estimated for the Carabus nd5 data set.

Data S1. DNA extraction, purification and PCR reaction.

Data S2. Information on phylogenetic analyses conducted for the genus Carabus nd5 data set.

Data S3. Calibration hypotheses used for dating of genus Carabus.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.