On the use of the coefficient of variation to quantify and compare trait variation

Abstract Meaningful comparison of variation in quantitative trait requires controlling for both the dimension of the varying entity and the dimension of the factor generating variation. Although the coefficient of variation (CV; standard deviation divided by the mean) is often used to measure and compare variation of quantitative traits, it only accounts for the dimension of the former, and its use for comparing variation may sometimes be inappropriate. Here, we discuss the use of the CV to compare measures of evolvability and phenotypic plasticity, two variational properties of quantitative traits. Using a dimensional analysis, we show that contrary to evolvability, phenotypic plasticity cannot be meaningfully compared across traits and environments by mean‐scaling trait variation. We further emphasize the need of remaining cognizant of the dimensions of the traits and the relationship between mean and standard deviation when comparing CVs, even when the scales on which traits are expressed allow meaningful calculation of the CV.


Impact Summary
Statistical analyses in ecology and evolution often involve the calculation of summary statistics to facilitate interpretation. However, the transformation of the data involved in these calculations are often performed with little attention given to the meaning of the numbers. In some cases, this compromises the meaning of the analyses and undermines the conclusions of the studies. We illustrate this problem by showing how the calculation of the coefficient of variation (CV), a mean-standardized measure of variation regularly used to quantify and compare variation of phenotypic traits, can become meaningless if one does not pay attention to the dimension of the entities measured, the scale on which these entities are measured and the relationship between the mean and the measure of variation. To minimize these common mistakes, we advocate a stronger emphasis on the meaning of the numbers when teaching quantitative methods.
Advanced statistical models to handle increasingly large and complex datasets are often employed at the expense of attention given to the meaning of the numbers Tarka et al. 2015). This issue affects several aspects of the scientific process, from the measurement procedures to the interpretation of the statistical analyses where biological significance is often confounded with statistical significance (Yoccoz 1991;Tarka et al. 2015;Wasserstein and Lazar 2016). Here, we show that even the use of simple statistics such as the coefficient of variation (CV; standard deviation divided by the mean) can become uninformative or worse if attention is not paid to the meaning of the numbers when the CV is used to compare variation among quantitative traits.
Phenotypic plasticity and evolvability are two aspects of the variation of quantitative traits. Phenotypic plasticity corresponds to the variation expressed by a genotype when exposed to different environments (Bradshaw 1965;Schlichting 1986;DeWitt and Scheiner 2004), and evolvability (sensu Houle 1992) is the ability of a trait to respond to selection. Various measurements Plant height (cm) have been developed to quantify phenotypic variation produced by a given change in the environment or a given strength of selection. These have shown that quantitative traits differ in their sensitivity to environmental variation and in their ability to respond to selection, suggesting that both phenotypic plasticity and evolvability vary across traits, populations, and species (Mousseau and Roff 1987;Falconer 1989;Houle 1992;DeWitt and Scheiner 2004;Valladares et al. 2006Valladares et al. , 2014. To unravel the causes of such variation and predict the ability of organisms to adapt, many studies have compared phenotypic plasticity and evolvability across traits, organisms, and populations using different methods for standardizing variation (e.g., Daehler 2003;Davidson et al. 2011;Palacio-López and Gianoli 2011;Matesanz and Ramírez-Valiente 2019, for phenotypic plasticity, and Mousseau and Roff 1987;Houle 1992;Merilä and Sheldon 2000;Hansen et al. 2011, for evolvability). Recently, the CV or related statistics expressing variation in relation to the mean (e.g., CV 2 ) has been used to measure and compare both types of variation across traits (Fajardo and Piper 2011;Roscher et al. 2018;Acasuso-Rivero et al. 2019, for phenotypic plasticity, andHoule 1992;Hansen et al. 2003Hansen et al. , 2011.
Here, we show that despite apparent similarities, evolvability and phenotypic plasticity have different properties that prevent the use of CVs for comparing phenotypic plasticity across traits and environments. We then reiterate the cautions already expressed by several authors about the constraints imposed by the calculation of CVs on the scale of the measurement and on the mean-standard deviation relationship, and we show how ignoring these caveats when comparing trait variation may jeopardize the interpretation and the conclusions of such comparisons.

Measuring Evolvability and Phenotypic Plasticity
Following Houle (1992), evolvability can be estimated as the phenotypic change resulting from a given strength of selection, that is, the ratio between the phenotypic change and the selection gradient: e = z/β. Phenotypic plasticity, on the other hand, is described by the reaction norm of a trait, that is, the relationship between the phenotype and the environment. Measures of phenotypic plasticity are generally derived from the reaction norm, and in the simplest case (i.e., linear relationship between the environment and the phenotype) phenotypic plasticity can be measured as the average change in the phenotype per change in the environment δ = z2−z1 m2−m1 , where z 1 and z 2 are the phenotypic mean values of the trait measured in the environments m 1 and m 2 (Morrissey and Liefting 2016; Fig. 1). Thus, both evolvability and phenotypic plasticity measure phenotypic changes in relation to their respective triggering factors, namely, selection and environmental variation.
Despite apparent similarities, these two measures have different properties that constrain their use for further comparison. A dimensional analysis of these two quantities illustrates this point (See Schneider 2009, Chapter 6, for an introduction to dimensional analysis). The dimension of evolvability corresponds to the dimension of the trait z divided by the dimension of the selection gradient β, where the symbols between brackets indicate the dimensions of the parameters. Because the selection gradient is the slope of the regression of the relative fitness w on the trait z, the dimension of e is e = z × σ(z) 2 cov (w, z), Thus, evolvability has the dimension of the trait z squared divided by the dimension of relative fitness, w. Because relative fitness w is defined as the fitness divided by the mean fitness, it is a dimensionless number, and evolvability simply has the dimension of the trait squared, [z] 2 . This agrees with the Lande equation (Lande 1979), z = V a β, where evolvability defined as the ratio between the response to selection z and the selection gradient β equals V a , the additive genetic variance that has a dimension of the trait squared.
For phenotypic plasticity, a similar dimensional analysis shows that δ has the dimension of the trait z divided by the dimension of the environmental variable m: Thus, the dimension of phenotypic plasticity is more complex than the dimension of evolvability because it depends on both the dimension of the trait and the dimension of the environmental gradient across which phenotypic plasticity is measured (Forsman 2015).

Using Mean-Standardization to Compare Evolvability or Phenotypic Plasticity
To compare variation among traits with different means and dimensions, one can express variation proportionally to the traits' mean by dividing the measure of variation by the trait mean. This is the case when calculating CVs or squared coefficients of variation (CV 2 = σ(z) 2 /z 2 ; see Pélabon et al. 2011 for a discussion of the advantage of CV 2 ). Dividing the standard deviation that has the same dimension as the trait by the trait mean provides a dimensionless number that expresses variation as a proportion of the mean, or as a percentage of the mean when multiplied by 100. Houle (1992) suggested that evolvability can be expressed proportionally to the trait mean if measured as the coefficient of additive genetic variation CV a = σ a /z, where σ a is the square root of V a , the additive genetic variance. Hansen et al. (2003Hansen et al. ( , 2011 further showed that measuring evolvability as the squared coefficient of genetic variance (I A = V a /z 2 ) facilitates interpretation by making evolvability a proportional change in trait mean when the trait experiences a selection gradient of 1, that is, a selection as strong as selection on fitness itself. Using CV a or I A to compare evolvability of different traits is valid because it provides a dimensionless number comparable across traits. Considering I A , Notice that I A represents an elasticity, that is, a proportional change in the trait per proportional change in fitness (van Tienderen 2000;Caswell 2001;Hansen et al. 2003Hansen et al. , 2011.
In contrast, dividing plasticity δ by the trait mean does not provide a dimensionless measure of variation equivalent to a CV: Thus, dividing a measure of plasticity by the trait mean provides a measure of trait variation proportional to the trait mean per unit change of the environmental factor. Because this quantity is not dimensionless, it cannot be compared meaningfully when plasticity is measured across different environmental gradients (Fig. 1).
Several studies comparing phenotypic plasticity have acknowledged this issue. For example, studies comparing phenotypic plasticity between native and invasive species have used pairwise comparisons of the CV only when plasticity of the native and invasive species was measured on the same traits across identical environmental gradients, thus avoiding comparing variation generated by different environmental factors (within experiment comparison in Fig. 1; Daehler 2003;Davidson et al. 2011;Palacio-López and Gianoli 2011). In contrast, comparing mean standardized phenotypic plasticity of traits measured along different environmental gradients (among experiment comparison in Fig. 1; e.g., Murren et al. 2014;Acasuso-Rivero et al. 2019) is meaningless.
In theory, mean-standardization of both trait variation and environmental variation would allow expressing phenotypic plasticity as an elasticity (i.e., a proportional change in the trait for a proportional change of the environmental factor), thus offering the possibility of comparing phenotypic plasticity across traits and environments. Such an approach was used by Wellstein et al. (2013) to test the relationship between intraspecific variation in plant traits and the variation of environmental parameters such as light, soil moisture, temperature, pH, and soil nutrients. Unfortunately, environmental gradients along which phenotypic plasticity is often estimated (e.g., temperature, latitude, presenceabsence of predators, and food availability) are often expressed on ordinal, nominal, or interval scales that do not allow meaningful calculation of the CV (Box 1). Because CVs are meaningful only for variables expressed on ratio or log-interval scale (Lewontin 1966;Yablokov 1974;Hansen et al. 2011;Houle et al. 2011), the use of elasticity to compare phenotypic plasticity among traits and environments is most likely restricted to very specific cases.
Alternatively, one could divide the change in the environmental variable by its standard deviation. Combined with the mean-standardization of the change in the trait, this provides a measure of phenotypic plasticity where a proportional change in the trait is generated by a change in environmental factor of one standard deviation. Assuming that the variation of the different environmental factors has been measured in the natural environment, and that this variation is symmetrically distributed around the mean, such a measure of plasticity would allow meaningful comparison of the phenotypic variation among traits and among environments, based on the relative variation of the environmental factors. However, comparing such measures would be meaningless for phenotypic variation estimated in experiments where the magnitude of the variation of the environmental factor is fixed by the experimental design and generally chosen to generate detectable changes in the phenotypic traits.

Further Caveats While Using CVs to Compare Trait Variation
The CV expresses variation of an entity on a proportional scale that is easily interpretable when comparing variation among entities. If this remains the only goal for computing CVs, the only restriction for this computation concerns the scale on which entities are measured (Table 1, Box 1). However, interpreting differences among CVs may be seriously compromised unless the mathematical properties of the CV and the constraints imposed by the calculation of the CV on the properties of the trait distribution are considered (e.g., Lewontin 1966;Yablokov 1974;Lande 1977;Houle 1992;Gingerich 1993;Garcia-Gonzalez et al. 2012).
For many traits (e.g., mass, metabolic rate, and length measurements), the standard deviation increases with the mean, and it is often assumed that the CV provides a measure of variation independent of the mean. This is true, however, only when the increase in the standard deviation is proportional to the increase in the mean (i.e., power of 2 in the Taylor power law between the variance and the mean; σ 2 = aμ 2 , Taylor 1961). Unfortunately, as noticed by Van Valen (2005) about this proportionality: "This is so often true that we may tend to forget that there are cases where it is not." For example, nonproportionality between the standard deviation and the mean is revealed by the negative relationship often observed between CV and trait mean of linear measurements of morphological traits (Bader and Hall 1960;Yablokov 1974;Soulé 1982;Pengilly 1984) or by the positive relationship observed between the CV and the mean body mass in mammals and birds (Hallgrímsson and Maiorana, 2000). Although a negative relationship between traits mean and CV can result from the effect of size-independent measurement error (and can therefore be accounted for; Lande 1977;Rohlf et al. 1983), other factors may generate such a nonproportionality (see below). Yet, in many studies, differences in CV have been interpreted as resulting from biological/ecological differences (e.g., differences in the fitness-trait relationship or differences in the intensity of competition) without testing the proportionality between the standard deviation and the mean, that is, without testing whether the CV truly provides a mean-independent measure of variation. As noticed by Einum et al. (2012), the problem is even deeper because we generally do not have a null hypothesis concerning the relationship between the mean and the standard deviation, that is, we do not know what such a relationship would be in absence of external (i.e., ecological) factors affecting variation.
The nonproportionality between the mean and the standard deviation is not problematic if one's goal is to quantify or predict variation. For example, if two traits have different evolvabilities (I A ), it means that the trait with the highest evolvability will evolve proportionally more than the trait with the lowest  evolvability when exposed to selection of similar strength, whether or not the mean and the standard deviation change proportionally among traits. However, further interpretation of such a difference in evolvability should consider the possibility that this difference results from a nonproportional relationship between the mean and the standard deviation. Understanding the causes for such nonproportionality may become critical for interpreting differences in variation among quantitative traits. Below, we present some of the most common causes for nonproportionality between the mean and the standard deviation and we discuss the consequences of these when comparing variation. Lande (1977) showed that CVs of objects measured as length, area, or volume are expected to differ according to the number of dimensions of the measurement (length = 1, area = 2, and volume = 3) and the correlations between these dimensions. Thus, for objects of constant shape, that is, with a correlation of one between the different linear measurements (length, height, and width), the CV of a volume (i.e., length 3 ) will be three times the CV of the length, whereas the CV of an area (length 2 ) will be twice the CV of the length. Consequently, we expect mass measurements to have larger CVs than area or length measurements. If the objects vary in shape and size, these factors (2 and 3) are expected to be upper limits of the multiplicative difference in CVs between objects. Additionally, for complex traits composed of multiple parts that covary, the increase of the standard deviation with the mean depends on the sign of the covariance, a positive covariance (i.e., coordinated variation) leads to a steeper increase of the variation (Taylor power >2), whereas a negative covariance (compensatory variation) leads to a shallower increase (Taylor power < 2;Mitteroecker et al. 2020). When computing the phenotypic CV for length and mass measurements of the data gathered by Hansen et al. (2011), we found that the average CV for mass measurements was more than an order of magnitude larger than for length measurements (average ± SE of CV mass = 3.15 ± 1.11, n = 38; CV length = 0.16 ± 0.03, n = 203, SE obtained by nonparametric bootstrapping), thus suggesting that several factors such as dimensions, correlations among dimensions, and complexity of the traits can simultaneously affect the value of the CVs. Nonproportionality between the mean and the standard deviation may also result from traits described by statistical distributions that differ from normal or log-normal distributions. Indeed, for some distributions, we explicitly expect nonproportionality between the mean and the standard deviation. In Figure 2, we present such an example with the variation in clutch size among 32 bird species. Because clutch sizes in birds and litter sizes in mammals do not follow a normal or log-normal distribution, the mean and the standard deviation are not expected to vary proportionally. Accordingly, the CV in clutch size decreases with an increasing mean clutch size (Fig. 2, the problem is the same for meristic traits). Therefore, if clutch sizes have on average lower CV in species with larger clutches compared to species with smaller clutches, one should be cautious when interpreting this difference. Of course, the nonproportionality between the mean and the standard deviation and the resulting difference in CVs between small and large clutches may reflect true biological differences in the variability of clutch size when expressed on a proportional scale (it may be easier to double a clutch of one egg than a clutch of six eggs), but further interpretation of the differences in the CVs should account for this effect before considering other factors such as the trait-fitness relationship, or the effect of environmental variation. For traits expressed with binomial distributions such as probability to survive or reproduce, the specific relationship between the mean and the variance (maximum variance for P = 0.5 and zero for P = 0 or 1) generates CVs approaching infinity or zero for small and large values of P, respectively. Although a standardization of such CVs has been suggested (i.e., divided by the maximum possible CV; Morris and Doak 2004), these relativized CVs are not comparable to CVs estimated for traits with normal or log-normal distribution (Hilde Christoffer et al. 2020).

Box 1: Scale types, permissible transformations, and meaningful CV
When performing measurements, we assign numbers to entities so that the relationship among numbers reflects an empirical relationship of interest among entities. Scales are imposed by these empirical relationships and the different types of scale are defined by the possible transformations of the numbers that preserve the empirical relationship, so called, permissible transformations (Stevens 1968;Hand 2004. For example, in an interval scale, permissible transformation should preserve the order of the numbers and the interval between two numbers and thus, only a monotonically increasing function is permissible. In the ratio scale, permissible transformation should preserve the order of the number as well as the order of the differences and ratios between entities. On this scale, only multiplication by a constant is permissible. For example, if four individuals have mass of a = 5, b = 10, c = 20, and d = 22 g, respectively, multiplying each number by 2 preserves the order of the difference (ba > dc and 2b -2a > 2d -2c), as well as the order of the ratio (b/a > d/c and 2b/2a > 2d/2c). However, raising the numbers to a power of 2 does not preserve the magnitude of the differences (ba > dc but b 2a 2 < d 2c 2 ). In the log-interval scale, only the order between the ratios should be preserved by transformation, and the power transformation becomes permissible (b/a > d/c and b 2 /a 2 > d 2 /c 2 ). It is sometimes possible and meaningful to convert interval scale measurements into ratio scale measurements. For example, converting birth year to years since birth (i.e., age) allows meaningful comparison of the new values by taking their ratio (I am now three times older than my daughter), whereas the ratio of our birth years is meaningless.
Meaningful CVs can only be calculated for a restricted number of scales. For any scales where the zero point is not defined (nominal scale and ordinal scale) or arbitrarily chosen (interval scale), it is not meaningful to calculate a CV and talk about proportional changes. Similarly, the calculation of the CV may be compromised for any scale where the mean can be equal to 0 (signed-ratio scale or difference scale; in the difference scale, the zero point corresponds to ln(1)). Notice that a clearly defined zero point does not necessarily mean that 0 has a clear biological meaning. For example, if we use gram or centimeter to measure the size of some individual organisms, these two measurements have a clearly defined 0, but we do not expect to observe individuals of 0 g or 0 cm. Finally, for absolute scales such as probability, the calculation and the interpretation of the CV may be strongly affected by the distribution of the data and the mean-standard deviation relationship (see main text). Table 1 summarizes the different scales, their permissible transformation, and whether the calculation of CVs is meaningful.
Still, in most studies that have used CVs to compare variation among traits, authors have explained the observed differences without testing the proportionality between the mean and the standard deviation and without accounting for the possible effect of among-trait differences in dimensionality on the value of the CV (e.g., Blanck and Lamouroux 2007;Greenway and Harder 2007;Fajardo and Piper 2011;Roscher et al. 2018;Acasuso-Rivero et al. 2019). At best, such practice introduces variation in the CVs that decreases the statistical power of any comparisons between groups of traits (e.g., Acasuso-Rivero et al. 2019). In other cases, this may lead to potentially erroneous conclusions. For example, in the study by Roscher et al. (2018), differences in CVs between traits related to gas exchange and growth on the one hand and traits related with leaf morphology, anatomy, and photochemistry on the other hand were interpreted as due to differences in the "level of organization," whereas the two groups of traits markedly differed in their dimensions and statistical distribution (see Ramírez-Valiente et al. 2018 for a similar issue). Although the conclusions of these studies may turn out to be correct, they remain questionable as long as the possible effects of a nonproportional increase of the standard deviation with the mean on the CVs have not been considered.

Conclusions
The problems exposed here are common in the literature in ecology and evolution where using the CV as a dimensionless measure of variation is widespread. Many studies have calculated CVs from variables on signed-ratio scale (i.e., variables taking both positive and negative values; e.g., stigmatic exertion, Larrinaga et al. 2009; style deflexion, Dai et al. 2016) or interval scale (temperature, Sammarco et al. 2006;reflectance spectrum, chroma, or hue, Mennill et al. 2003;Ibáñez et al. 2013;Jacobs et al. 2015;Charmantier et al. 2017), and many studies have compared CVs between traits with different dimensions or statistical distribution. Notice that variance-standardization (e.g., Z-transformation, heritability, and selection intensity) is often subject to similar shortcoming when it comes to compare variation (Hereford et al. 2004;Hansen et al. 2011;Houle et al. 2011;Matsumura et al. 2012). More generally, standardization and transformation of data are routinely performed before data analyses without paying attention to the consequences of these manipulations on the meaning of the numbers. In many cases, this renders the conclusions of the studies questionable. We believe that dimensional analyses as the one performed here (see also Schneider 2009) and a better awareness of the different types of measurement scale should become standard tools (i.e., taught along with statistical analyses in quantitative biology classes) to