Phylogenetics to help predict active metabolism

This paper shows how to build predictive models involving phylogenetic information to estimate metabolic traits such as active metabolic costs. Fish swimming cost is often estimated from body mass and swimming speed. The parameters of the relationships between these variables and swimming cost vary among species because each species has its own morphology and physiology. It is now widely recognized that traits are phylogenetically structured. Using new statistical approaches, it is possible to both correct swimming cost models for statistical phylogenetic non-independence and use the inherent phylogenetic signal to improve models. With these models one can extend, to a larger set of species, empirical knowledge about traits that are difficult to obtain; swimming cost is one such trait. Swimming cost accounts for a large and variable component of a fish energy budget, yet models have only been developed from observations performed on a few species, thereby constraining the scope of bioenergetic models. Here, we propose a method where body mass and swimming speed are used together with phylogeny to predict swimming cost. The resulting model explained a large proportion of the variation (90%) in the forced swimming cost of 16 fish species submitted to forced swimming experiments. We also compared phylogenetically-explicit predictions for forced swimming experiments with experimental results of routine swimming for five species, among which one was not used to build the model. Results confirmed that forced swimming underestimates the cost of unsteady swimming. The phylogenetic modeling could be used to estimate other variables of interest in bioenergetic studies.


INTRODUCTION
Active metabolism is an important and variable component of energy budgets (Boisclair and Leggett 1989, Boisclair and Rasmussen 1996, Aubin-Horth et al. 1999).The in situ estimation of fish swimming cost involves models to circumvent the technical difficulty of measuring outcomes of metabolic processes (e.g., oxygen consumption) on individual organisms swimming in the wild.These models are generally loglinear functions involving variables such as body mass and swimming speed whose implementation requires data obtained from respirometry experiments.These experiments consist in measuring oxygen consumption by depletion inside a sealed, oxygen-tight, vessel called a respirometer (Blazka 1958).To obtain a swimming cost model, suites of such experimental assays have to be performed with specimens having various masses and swimming at different speeds.
When implementing swimming cost models using data from multiple species, observations cannot be considered independent as the species share different degrees of common ancestry to one another (Felsenstein 1985).Also, the parameters of swimming cost models may vary among species as a consequence of their particular morphological and physiological traits, and the differences among values of the parameters may depend on the phylogenetic distances among the species.Therefore, a swimming cost model estimated using data from one species may not give a good estimate of the swimming cost of another species whereas a general swimming cost model estimated using multiple species data will have biased parameter values if the latter are not estimated while taking the lack of statistical independence among observations into account.
The effect of body mass and swimming speed on swimming cost is fairly simple to understand.A heavier specimen experiences larger hydrodynamic drag than a smaller one as a consequence of its larger cross-section and length.The consequences of swimming speed for a fish are the same as for any object moving in a viscous environment: the organism produces more drag at high than at low speed.Greater drag implies that more work has to be deployed to travel a given distance at a given speed.While experimentally inducing variation in body mass is easily done by selecting individuals of various sizes, the representation of the effect of swimming speed involves many more assumptions regarding fish locomotory behavior.Under the assumption that the conditions in their habitat allow fish to swim steadily, the experiments exploit the fact that many fishes are positively rheotactic (i.e., they will turn to face into an oncoming current) and consist in make them hold a position against a steady flow of water.We will refer to the latter as ''forced swimming'' experiments.Whether or not (and under what circumstances, if any) steady swimming is a reasonable assumption for modeling real swimming cost has been questioned (Webb 1991).When swimming is unsteady, costs associated with acceleration and turns are substantial and have to be accounted for in experiments.One such approach consists in granting fish with enough freedom of movement to adopt swimming patterns resembling those they would feature under natural conditions (Tang and Boisclair 1993).We will refer to that as ''routine swimming'' experi-ments.Response to given experimental conditions designed to induce spontaneous swimming movements may vary among species as a consequence of their particular behavioral traits.
Species traits (e.g., morphological, physiological, behavioral) are the outcome of evolutionary processes and are therefore structured with respects to phylogeny (Felsenstein 1985).Methods now exist to use the among-species phylogenetic patterns of trait variation purposefully, to predict trait values (Phylogenetic Eigenvector Maps [PEM]; Gue ´nard et al. 2013).Knowing the phylogeny and trait values for a set of species, one can quantify the manner whereby a trait evolved using phylogenetic eigenfunctions.The eigenfunctions describe the candidate patterns of among-species trait variation across the phylogeny and are used to model that variation.The resulting model can be used to predict trait values for species where they are unknown.Phylogenetic modeling is especially useful when trait values are difficult, time consuming, or expensive to obtain.When applied with proper care, the PEM method allows researchers and practitioners to maximize the use of information acquired at great expense.
Swimming cost measurements are expensive to obtain and fish form a very diverse group that has evolved over 530 million years.Hence, multiple species phylogenetic modeling appears to be a pragmatic alternative to building separate swimming cost models for every species of interest.The method also helps solve the problem for rare or endangered species, on which experimentation would likely be forbidden, or species that are costly to obtain (e.g., abyssal).It allows one to build synthetic models involving multiple species while implicitly correcting for the parameter estimation bias associated with using data from multiple species that have not evolved independently of one another.
The purpose of the present study is to illustrate how swimming cost information from published studies, combined with that obtained from DNA sequences and phylogenetic estimation methods, can be used to build multi-species, phylogenetically-explicit and unbiased swimming cost models.

Data sources
The data set used to develop the multi-species swimming cost model was collected by Boisclair and Tang (1993) from 24 published studies.In addition to forced and routine swimming experiments, the original data set includes a third type called ''directed swimming'' experiments, in which fish were trained to follow a target area moving above the respirometer (Muir et al. 1965, Wohlschlag and Cameron 1967, Muir and Niimi 1972).For the sake of simplicity, and since the directed swimming experiments were performed on only two species (Kuhlia sandvicensis: Aholehole; Muir et al. 1965, Muir andNiimi 1972 andLagodon rhomboides: Pinfish;Wohlschlag and Cameron 1967), we decided to focus our analysis on the species with forced swimming cost data (i.e., the type of experiment with the most abundant data), excluding data from directed swimming experiments, and to use the routine swimming data as an illustrative application of phylogenetic swimming cost models.For the latter application, we also added data from two studies on the brook trout (Salvelinus fontinalis; Tang andBoisclair 1995, Tang et al. 2000).Each entry of the data set synthesizes the result of a swimming cost experiment and features fish body mass (g fresh mass), swimming speed (cm s À1 ), the type of experiment (forced or routine swimming) and the net cost of swimming in terms of oxygen consumption (mgO 2 h À1 ).The latter was obtained by subtracting standard metabolic rates and osmoregulation costs from gross rates of oxygen consumption measured in respirometers (see Boisclair and Tang 1993 for details).

Phylogenetic analysis
We used DNA sequences downloaded from the U.S. National Center for Biotechnology Information's GenBank website to estimate the fish phylogeny (http://www.ncbi.nlm.nih.gov/genbank/; database: Nucleotide).Sequences encompass the mitochondrial genomes and the nuclear 28S, 18S, and 5.8S rRNA transcripts and internal transcribed spacers (ITS 1 and ITS 2).We focused on complete sequences as much as possible and, otherwise, took the longest partial sequences available.When data on our species were not available, we borrowed sequences from related species in the same genus or family, unless other species in the same genus or family were present in our data set.Multiple DNA sequences alignments were performed separately for each gene using program Muscle version 3.8.31(Edgar 2004; http://www.drive5.com/muscle).We estimated phylogenies using a maximum likelihood method (Felsenstein 1981, Felsenstein andChurchill 1996;program: fdnaml from package EMBOSS version 6.3.1:Rice et al. 2000) on the super-alignment obtained by concatenating sequences for all genes.Species were ordered in decreasing order of their number of aligned base pairs to help the tree construction algorithm by including species with sparse DNA data into trees after the species for which more complete DNA data was available.

Modeling approach
The modeling approach used was similar to that described in Gue ´nard et al. (2011).It involved the computation of a multiple regression model in which swimming cost descriptors were used together with phylogenetic eigenvectors.Phylogenetic eigenvectors are linearly independent variables obtained from the structure of the phylogenetic tree (topology and branch lengths; Diniz-Filho et al. 1998, Desdevise et al. 2003, Gue ´nard et al. 2013); they allow one to represent phylogenetic patterns of variation, for instance in multiple regression.We preferred to use phylogenetic eigenvectors rather than other alternatives (e.g., Martins and Hansen 1997, Garland and Ives 2000, Bruggeman et al. 2009) because of the relative simplicity of that approach and its ability to assess interaction between the phylogeny and swimming cost descriptors.Swimming cost descriptors include body mass (M ) and swimming speed (V ).The original phylogenetic tree had a single tip per species and therefore had to be expanded because multiple experiments had been performed for each species.We obtained an expanded tree having a tip for each experimental result by binding the root of a star tree having branch lengths of 0 and as many tips as the number of experimental results to each tip of the species tree.This approach assumes that phylogenetic patterns of variation at the intra-specific level (i.e., among the different experimental specimens of a same species) are negligible.
The swimming cost model was described by the following linear equation: where C i,type,M,V is the swimming cost of specimen i with mass M (g wet mass) belonging to species sp, forced to swim at speed V (cm s À1 ).Coefficients b ... are the regression parameters associated with the different components of the model (b 0 is the intercept of the model; 3 in subscripts denote parameters associated with interactions), u sp,... are the loadings of species sp on eigenvectors, j are the eigenvectors involved in modeling the main effect of the phylogeny, k are the eigenvectors involved in modeling the interaction between body mass and phylogeny, l are the eigenvectors involved in modeling the interaction between swimming speed and phylogeny, and e i is a normal deviate with mean 0 and variance r e .
To avoid overfitting and ensure that the resulting model be as general as possible, we regularized the multiple regression model using ridge regression (see Legendre and Legendre [2012] for a description of the method).We estimated the ridge regression penalty factor that gave the best out-of-the-sample predictions, which we assessed using jackknife cross-validation, removing one species at a time, each time estimating the model with the remaining species and making predictions for the removed species using the resulting model.That procedure was repeated for all species in the data set.Comparison between the observed and predicted forced swimming costs was performed using the crossvalidation R-squared calculated as follows: where y i and y pred,i are the observed and predicted values of swimming cost, respectively, for individual i and y is the mean observed swimming cost.The R 2 cv takes the value 1 for a perfect prediction and the value 0 when the mean square prediction error equals the mean square deviation from the mean (i.e., the observed variance), in which case the model is no better than to simply take the mean.It is therefore interpreted in the same way as an adjusted coefficient of determination (R 2 adj ) in regression analysis.The R 2 cv has no negative boundary, as there is no theoretical limit to how badly a model may predict reality.The penalty factor was estimated as that maximizing R 2 cv by gradient descent using the bound constrained (to allow only for positive penalties) L-BFGS-B algorithm (Byrd et al. 1995).In addition to the prediction coefficient presented above, we assessed the accuracy of the model for individual predictions i by computing the deviation factor, dev i , obtained as follows: That factor has no measurement units and indicates the number of times models overestimate (dev i .0) or underestimate (dev i , 0) the observed swimming cost.Values close to 0 indicate accurate predictions.
All calculations other than sequence alignment and tree estimation were performed using the R language for statistical computing (R Development Core Team 2014) and R packages ape (Paradis et al. 2004), MASS (Venables and Ripley 2002), and MPSEM (Gue ´nard et al. 2013).

Data mining
The data set we obtained comprises 976 experiments on 17 fish species reported in 26 studies (Table 1).These experiments were performed on specimen having mass ranging from 1.1 to 1432 g and swimming at speeds ranging from 0.314 to 143 cm s À1 ; their swimming costs ranged from 2.7 lgO 2 h À1 to 923.7 mgO 2 h À1 .Data were sparse for most species.The widest range of body mass and swimming speeds was observed for the Sockeye salmon (Oncorhynchus nerka; body mass: 3.38-1432 g, swimming speed: 0.314-143 cm s À1 ) while the second largest such range was obtained for the Rainbow trout (Oncorhynchus mykiss; body mass: 27.0-471 g, swimming speed: 10.2-103 cm s À1 ), both pertain-ing to the same genus.The other species featured narrower ranges of swimming speeds, generally spanning factors of 10 or less, and even much narrower ranges of body masses: three species were only represented by a single value of body mass whereas most other species were represented by 2 or 3 values spanning factors of 2-5.Among these data, we found 855 values, representing 16 species, from forced swimming experiments (the Brook trout, Salvelinus fontinalis, had no forced swimming cost value), that we used to build the phylogenetic forced swimming cost model.Comparisons between phylogenetically-explicit forced swimming predictions and routine swimming observations were performed using the remaining 121 values (representing five species) from routine swimming experiments.
We found three genes whose sequences were available for all 17 species: cytochrome oxidase subunit I (9 complete, 8 partial), cytochrome b (13 complete, 4 partial), and DNA coding for the 16S ribosomal RNA (11 complete, 6 incomplete).Complete ribosomal genomes were available for nine species whereas the sequence coding for the small subunit (12S) ribosomal RNA was available for 15 species (9 complete, 6 partial).Sequences for nuclear ribosomal RNA and associated intergenic spacers were sparse: 18S RNA was available for eight species while the sequence for intergenic spacer 2 (that between 5.8S and 28S rRNA) was only available for a single species and therefore could not be used to estimate a phylogeny.Species shared a minimum of 135 and a maximum of 4457 base pairs with one another (median value: 849 base pairs).

Phylogenetic analysis
We rooted the estimated phylogenetic tree on infra-class Teleostei (Fig. 1).The bi-partition occurs between the ostariophysians, for which the Goldfish (Carassius auratus, family: Cyprinidae) is the sole representative, and super-orders Protacanthopterygii and Acanthopterygii.The separation of the latter super-orders immediately follows, with order Protacanthopterygii represented by the four salmonids.Among the

Phylogenetic model
We estimated 15 phylogenetic eigenvectors (named u 1 -u 15 ) from the sub-tree of the 16 species for which forced swimming cost data were available; then we estimated the prediction scores for the Rainbow trout, O. mykiss, using its phylogenetic location with respect to the other species (Fig. 2).The phylogenetic eigenvectors represent particular patterns of phylogenetic variation that were selected to intervene in the model.It is noteworthy that the eigenvectors describe not only the phylogenetic patterns Fig. 1.The phylogenetic tree estimated by maximum likelihood using a selection of mitochondrial and nuclear ribosomal DNA sequences obtained from the NCBI GenBank.Fig. 2. The phylogenetic eigenvectors that were selected to model swimming cost: (A) Species loadings on the eigenfunctions showing the patterns of phylogenetic variation that they represent, (B) legend associating species loadings to the circle sizes (absolute values proportional to the surface area of bubbles) and colors (black and white marker: negative and positive species loadings, respectively).Prediction scores for the Brook trout (Salvelinus fontinalis), which is used exclusively to make predictions, are highlighted in gray.
v www.esajournals.orgassociated to the topology but also the branch length structure of the tree.
The estimated ridge regression penalty factor was 38.66, with an R 2 cv of 0.81 obtained from the set of cross-validation trials that maximized predictive power.The R 2 adj (adjusted for the number of parameters in the model) of the phylogenetic forced swimming cost model was 0.90 whereas the R 2 adj for the partial model, with only the (log 10 ) body mass and the (log 10 ) swimming speed, was 0.85 (Figs. 3, 4A).
The most extreme model deviations was an overestimation (dev ¼ 24.4) for an experiment with a 62 g Goldfish swimming at 25 cm s À1 and that was reported by Smit et al. (1971) to have had an oxygen consumption rate of 0.193 mg h À1 , with the model prediction being 4.90 mg h À1 .The second most extreme value was also an overestimation (dev ¼ 21.8) of swimming cost for an experiment with a 50 g Sockeye salmon swimming at 0.314 cm s À1 , which was reported by Brett (1964) to have consumed oxygen at a rate of 2.7 lg h À1 (the lowest value in the whole data set) while the prediction was 61.5 lg h À1 .All other predictions (853 over 855 ¼ 99.7%) were within a deviation factor of 610, and about three quarters (73.5%) within a deviation factor of 61.On a specific basis, the median absolute deviation

Comparison to routine swimming
Comparison between the specific estimates of forced swimming cost and routine swimming cost indicates that most values were found at a relatively constant distance above the 1:1 log-log line and that forced swimming cost underestimated routine swimming (Fig. 4B) cost by a deviation factor of À6.4 on average (K.sandvicensis: À9.5, C. auratus: À7.2, L. macrolepis: À7.0, S. fontinalis: À5.6, and O. mykiss: À1.9).That finding agrees with that of Boisclair and Tang (1993) whose results suggest that forced swimming costs underestimate routine swimming costs by factors ranging from 6.4 to 14.0.

DISCUSSION
In the present study, we showed that phylogenetic modeling can be used to enhance our ability to predict elements of metabolism.We illustrated that possibility using fish swimming cost but adaptations of the same approach may be used to predict other relevant quantities for other types of organisms.For instance, parame-ters of models predicting standard or basal metabolic rate, urinary and fecal losses, maximum consumption rate, or swimming capacity, may also vary among species and may possibly feature phylogenetic structures that may be estimated from phylogeny.The number of species used here (17: 16 to build the forced swimming cost model and 1 used exclusively for swimming cost comparisons) is modest in comparison to the number of extant teleost species ('27000), showing that phylogenetic modeling is achievable with a fairly small number of species.
It is worth mentioning that the quality of the available data set, in terms of the coverage of body mass values and swimming speeds for the different species, was somewhat remote from the ideal.The ideal data set to reliably discriminate the actual effect of body mass, swimming speed, and phylogeny would have consisted in a set of standard body masses repeated for a standard set of swimming speeds, with evenly replicated body mass-swimming speed combinations, repeated for all species.Such a stringent sampling scenario would have insured the absence of collinearity among descriptors, more especially between the experimental conditions (body In the present study, we found that body mass was strongly related to phylogeny (R 2 adj ¼ 0.67, F 15,839 ¼ 115.5, p , 0.0001), as was, though to a lesser extent, swimming speed (R 2 adj ¼ 0.17, F 15,839 ¼ 12.63, p , 0.0001); body mass and swimming speed were only slightly correlated (R 2 adj ¼ 0.025, F 1,839 ¼ 23.00, p , 0.0001).Although body mass at a particular life stage is a trait that we expect to be related to phylogeny, in experimental settings this factor could be artificially controlled (i.e., by selecting fish of particular sizes) to reduce the collinearity between the phylogenetic eigenvectors and body mass.In the present study, we used ridge regression to mitigate that issue as much as numerically possible, yet it was not possible to do so entirely given the sparse coverage of different body masses (Fig. 3).Because of the irregular manner in which body mass and swimming speed are spread among the species, it is likely that their joint apparent contribution in the model (R 2 adj ¼ 0.85) is exaggerated.A greater fraction of that variation could well have been associated to phylogenetic eigenvectors, had the sampling been more regularly distributed.We regard that situation as being a likely limitation of exploiting metaanalytical data sets stemming from the fact that the original purpose of the data was indeed different from that of the meta-analysis.By opposition, the original studies were focusing on a particular range of sizes and swimming speeds of a single (or two) species.
Phylogenetic bioenergetic models could be implement with relatively little information insofar as that information is correctly targeted at representing the broadest possible range of conditions (e.g., body mass, swimming speed, temperature) and spanning broad phylogenies, including, for instance, other Osteichthyes such as species of sturgeons in the case of fish models.For example, one could define four different masses, e.g., 0.01, 0.1, 0.5 and 1 times the mean adult size; and four different speeds, e.g., 0.1, 0.5, 1 and 2 body length s À1 .Repeating each treatment on different specimens, say three, and performing experiments with, say, 20 species would amount to a sample size of 960, which is similar to the one we had in this study.Such a data set would, however, form a better design for model construction: the orthogonality of the controlled variables, body mass and swimming speed, in the model would insure that their contributions to the model would not be artificially distorted by sampling-related issues but, instead, be the result of the species' own features (Gue ´nard et al. 2014).Similar rationales would apply to phylogenetic modeling of standard metabolic rate, swimming capacity, or maximum consumption rate, with the standard set of swimming speeds being replaced by a standard set of temperatures.
The comparison of forced and routine swimming cost, using this time an analysis taking phylogeny into account, further confirmed the conclusion of Boisclair and Tang (1993) that forced swimming cost models underestimate fish routine swimming cost by a large extent.Routine swimming cost involves spontaneous accelerations and turns that are standard features of the locomotory behavior of most fish species (e.g., Webb 1991).Most of the information available on fish swimming costs comes from forced swimming cost experiments.Nevertheless, it is important for users of swimming cost models to experimentally confirm that the context of their study enables their target fishes to swim steadily (i.e., at a constant speed, without accelerations and turns).A failure of the experiment to meet that assumption would lead users astray, by a large margin, about the actual size of swimming costs.
Provided that sufficient effort is deployed to collect the necessary data, we regard phylogenetic modeling as a keystone method that can allow biologists to apply bioenergetic modeling in practice to species other than the well-studied ones and in contexts where multiple species are often involved.With phylogenetic models, the great efforts that have to be deployed to acquire information on metabolic rates, which traditionally involves performing numerous experiments, can be optimized, since variation in metabolic rates among the species of a group can be explicitly represented by phylogenetic eigenvectors used in an explanatory manner.For instance, a study improving our knowledge of fish swimming costs by adding two more species to the existing 16 species forced swimming cost model may be regarded as a more relevant addition to knowledge than a study providing two single-species models because the former would give us the ability to predict swimming costs for more than two species.We hope phylogenetic modeling will help scientists unfold the full potential of bioenergetic modeling in practice by increasing the ''species scope'' of the models predicting the different compartments of the energy budget.
salmonids, the tree shows the expected separation between genera Salvelinus (Lake trout, S. namaycush, Brook trout, S. fontinalis) and Oncorhynchus (Rainbow trout, O. mykiss, and Sockeye salmon, O. nerka).The first bi-partition in the acanthopterygian subtree separates order Gadiformes (Haddock, Melanogrammus aeglefinus) from the remaining species (orders Perciformes and Pleuronectiformes).Then follows a second separation between a first group composed of families Sparidae (represented by the Mullet, Liza macrolepis) and Cichlidae (represented by genus Oreochromis: the Nile tilapia, O. niloticus, and Mozambique tilapia, O. mossambicus); these two families belong to order Perciformes, leaving a second group composed of the remaining species.As expected from taxonomy, the groups split according to the families (the Mullet splitting from the two Tilapias).The remaining species split into a first group composed of the representatives of families Centrarchidae (Largemouth bass, Micropterus salmoides), Percidae (Walleye, Sander vitreum), and Kuhliidae (Aholehole, Kuhlia sandvicensis) and a second group composed of the Snook (Centropomus undecimalis, family Centropomidae) and members of order Pleuronectiformes.In the first group, the Aholehole splits from a group composed of the Largemouth bass and the Walleye.In the second group, the Snook splits from the Pleuronectiformes.The Pleuronectiformes subtree separates the Lemon sole (Microstomus kitt) from the remaining species and then the Common dab (Limanda limanda) from the European flounder (Platichthys flesus) and the European plaice (Pleuronectes platessa).While the estimated tree agrees with taxonomy most of the time, it suggests that order Perciformes (and, possibly, suborder Percoidei ) is paraphyletic since order Pleuronectiformes is embedded into it.

Fig. 3 .
Fig. 3. Swimming cost as a function of swimming speed and body mass: the observed values correspond to the background color of the markers whereas the values fitted by the phylogenetic model for the range of observed swimming speeds and body masses correspond to the background color of the plotting area.

Fig. 4 .
Fig. 4. Observed and predicted swimming costs.(A) Forced swimming cost predicted by the phylogenetic forced swimming cost models following leave-one-species-out cross-validation trials, (B) Routine swimming costs predicted using the phylogenetic forced swimming cost model.The diagonal line is the 1:1 line on which correct predictions would fall.

Table 1 .
Studies from which swimming cost data were obtained.