There is more than one way to skin a G matrix


Derek A. Roff, Department of Biology, University of California, Riverside, CA 92521, USA.
Tel.: +915 827 2437; fax: +951 827 4286; e-mail:


Because of its importance in directing evolutionary trajectories, there has been considerable interest in comparing variation among genetic variance–covariance (G) matrices. Numerous statistical approaches have been suggested but no general analysis of the relationship among these methods has previously been published. In this study, we used data from a half-sib experiment and simulations to explore the results of applying eight tests (T method, modified Mantel test, Bartlett’s test, Flury hierarchy, jackknife-manova, jackknife-eigenvalue test, random skewers, selection skewers). Whereas a randomization approach produced acceptable estimates, those from a bootstrap were typically unacceptable and we recommend randomization as the preferred method. All methods except the jackknife-eigenvalue test gave similar results although a fine-scale analysis suggested that the former group can be subdivided into two or possibly three groups, hierarchical tests, skewers and the rest (jackknife-manova, modified Mantel, T method, probably Bartlett’s). An advantage of the jackknife methods is that they permit tests of association with other factors, such as in this case, temperature and sex. We recommend applying all the tests described in this article, with the exception of the T method, and provide R functions for this purpose.


Evolutionary change is governed by two factors: the intensity of selection and the amount of genetic variation within and among traits. Information on the latter is contained with the genetic covariance matrix, generally designated simply as the G matrix (Lande, 1979; Arnold et al., 2008). The multivariate response to selection is then specified by the multivariate equivalent of the breeder’s equation, inline image, where z is the vector of mean trait responses and β is the vector of selection gradients. While the mean trait values change under selection so also will the G matrix, its orientation tending to shift in the direction of selection (Jones et al., 2003, 2004; Guillaume & Whitlock, 2007; Revell, 2007). Genetic drift may also play a role in changing the G matrix, but in this case the change will be random though on average producing a proportional change in the constituent variances and covariances (Lande, 1979).

The important role that the G matrix plays in evolutionary divergence has led to considerable interest in comparing G matrices among different populations of the same species and among different conditions such as different environments. However, matrices can differ in many different ways, and this has led to the development of multiple methods of comparing matrices (Steppan et al., 2002). No single test can capture all the different ways in which two or more matrices may vary (Houle et al., 2002). However, it is not clear to what extent different tests capture different elements of matrix variation. As a consequence, application of a single test may provide insufficient information on modes of matrix disparities. In this article, we use both simulation and an extensive data set from a half-sib experiment to compare various suggested methods of comparing matrices. Additionally, we introduce a new test, the jackknife-eigenvalue test that specifically tests for variation among the equivalent eigenvalues of two or more matrices.

The empirical data set we use in this study consists of a half-sib experiment using the mealworm beetle, Tenebrio molitor, in which 20 sires were mated to three dams and their offspring raised under three temperatures. Sexes were considered separately providing six sets of G matrices. The primary focus of this experiment was to investigate the evolution of cuticular melanization, a trait that has significant effects on fitness. To examine the potential effects on fitness, we measured the degree of melanization, encapsulation response, development time and body size. This set of traits is of particular interest as it includes physiological, morphological and life history traits and a complex set of interactions among the components. An essential requirement of a statistical test is that it correctly predicts the type 1 error rate. As this has not been performed for most of the methods of matrix comparison, we ran simulations in which two genetic covariance matrices generated from the same population were compared. The simulated genetic and phenotypic covariance matrices were generated using the averaged values from the observed six combinations.

Half-sib experiment

The beetles used in this study were collected from a laboratory stock population established originally from approximately 500 individuals gathered from countryside in Latvia. Altogether 80 mealworm beetles – 60 females and 20 males– were collected as pupae from the stock population and grown separately in plastic canisters into adults. All males were mated with three different females (A, B and C), and the eggs were allowed to develop at 28 °C.

Larvae of each family were raised individually and divided randomly into three groups of the same size. Groups were grown in temperatures of 18, 23 and 28 °C under a 14 : 10 light/dark period. At 10–14 days of adult age, the beetles were weighed to the nearest 0.1 mg. Encapsulation ability and degree of melanization were then assessed.

The encapsulation ability of an adult beetle was measured with a nylon monofilament. The beetle’s immune system was allowed to react for three hours while it was kept in an individual canister at its rearing temperature. The implant was then removed and frozen at −20 °C until photographed. The average light absorbance value for an implant was measured with ImageJ program. Average encapsulation response was calculated as the average of the values from the two different pictures. Standardized encapsulation value was obtained by subtracting the average encapsulation value from the light absorbance value of a clear implant. The bigger the standardized encapsulation value, the more efficient is the beetle’s encapsulation response. The repeatability of this method is very high (Rantala et al., 2002).

The melanization of the cuticle was measured as the light reflecting value of the beetle’s thorax with ImageJ program. The acquired values were transformed by reversing the sign so that the values would represent the biological scale: the higher the cuticle value, the darker the beetle’s cuticle.

On average, 10 (SE 0.2) individuals were measured from each family (n = 59) and temperature. A total of 1655 offspring were measured giving an average of 276 (SE 8.2) animals per sex/temperature combination.

Statistical methods

The simulated genetic and phenotypic covariance matrices were generated using the averaged values from the observed six combinations (Table 1). For the simulation, we used 20 sires each mated to three dams, which produced five offspring each. The simulation used an individual-based variance-components approach (Roff, 2010). We generated two data sets, designated as populations A and B, using this model and then used the given test to compare these two sets. In most cases, probability was estimated using randomization or bootstrapping. The randomization protocol was as follows: after calculation of the observed value, Xobs, families were randomly assigned to population A or B and a new value of X, say Xr, calculated. The randomization procedure was repeated R times and the null hypothesis tested by computing inline image, where n is the number of cases in which Xr ≥ Xobs or Xr ≤ Xobs, depending on the test (the extra 1 is required because the original observed value must be included). A similar procedure was used in the case of bootstrapping except that the two bootstrapped populations were generated by drawing with replacement families from each population (i.e. the bootstrapped population A consisted of families drawn with replacement from population A). An important consideration in a randomization or bootstrap test is the level at which the randomization or bootstrapping is performed. In general, the appropriate level is the set that includes all related individuals: thus, in the present case we randomized sire families between populations (for convenience we shall refer to the sex/temperature combinations as populations).

Table 1.   Matrix of heritabilities (diagonal), genetic (above diagonal) and phenotypic (below diagonal) correlations used to test the utility of matrix comparison methods for the experimental data sets analysed.

For the simulations, we used 500 randomizations or bootstraps, which was sufficient to generate the distribution of probabilities. In tests of the experimental data, to ensure an accurate estimate, this number was increased to 5000 if the estimated probability based on 500 replicates was < 0.1.

We ran 100 replicates and for each replicate and stored the estimated probability as described above, from the particular matrix comparison test. Under the null hypothesis, the cumulative distribution of probabilities should be in 1 : 1 relationship to the estimated probabilities. For visual comparison, the set of estimated probabilities were ranked lowest to highest and compared with the cumulative distribution. The hypothesis was tested using a single sample Kolmogoroff–Smirnoff test assuming a uniform distribution with a minimum of 0 and a maximum of 1. The purpose of this metric is to distinguish between tests than are clearly unacceptable from those that are acceptable. Given enough replicate runs, there could be a significant difference projected by the test: however, such a difference is likely to be of minor importance in the estimation of levels of significance.

Below we describe the different tests examined (summaries in Fig. 1 and Table 2). The G matrices were estimated by manova (Bulmer, 1985). All tests were written as R functions and are available in Dryad, doi:10.5061/dryad.kb27f3t1. Copies may also be obtained directly from DAR ( In addition to a half-sib design, programs are also given for the following designs: full-sib, offspring on parent, clonal and phenotypic. Matrix comparisons are largely restricted to these ‘standard’ methods because in cases with complex pedigrees, there is difficulty in deciding the basic sampling unit for randomization or jackknifing. There are also problems if not all traits have been measured on all individuals: our approach has been to eliminate any individuals for which full data were not available. With data sets in which there are large gaps, it would be worthwhile, after testing the restricted data set, to do tests involving a reduced number of traits that maximize the data used.

Figure 1.

 Pictorial representation of the tests examined. Except where stated, the graphs display the null hypothesis. For simplicity, two populations and two traits are represented. T method: This analysis tests the null hypothesis that the variances and covariances (two variances and 1 covariance for a 2 × 2 matrix) are equal. Modified Mantel test: tests the hypothesis that the correlation between the (co)variances is +1. Bartlett’s test: This tests the hypothesis that the determinant of the matrices are the same, that is, the area enclosed by, say, the 95% ellipses are the same. Rank test: This tests the hypothesis that both matrices have the same rank, which could be 2 (left panel) or 1(right panel in the present case). For reasons given in the text, we use the jackknife-eigenvalue test for this purpose. Hierarchical analysis: This moves through four possibilities: equal matrices, proportional matrices, common principal components or unrelated structure. Jackknife-manova: This tests the hypothesis that the matrices are equal. Explanatory variables other than or including population can be tested. Jackknife-Eigenvalue test: This tests the hypotheses that the ith eigenvalue (= variance along the ith axis) or summed value (= total variance) is equal to zero or that it is equal to that of another population. In the case shown, the populations have the same variance in the direction of the major axis but differ along the minor axis. As with the previous test, other explanatory variables may be tested. Random or Selection Skewers: This tests whether the response is the same in the two populations.

Table 2.   Null hypotheses of tests examined.
MethodNull hypotheses
T methodEquality of matrix elements
Modified Mantel testCorrelation of matrix elements = 1 (i.e. matrix elements from one population proportional to those of the other population)
Bartlett’s testEquality of size (‘volume’)
Hierarchical analysesEquality of matrix elements; proportionality of matrix elements; equality of principal components; unrelated structure
Jackknife-manova testEquality of matrix elements; variation not associated with predictor variables
Jackknife-eigenvalue testEigenvalue ≤ 0; equality of eigenvalues; equality of total variance; variation not associated with predictor variables
Random skewers testEquality of response to the same vector of selection gradients
Selection skewers testEquality of response to the same vector of selection differentials

Some tests are restricted to pairwise comparisons whereas others can compare multiple matrices. For the present analysis, we consider only pairwise comparisons, except in the case of the tests based on the jackknife as these permit tests of association with predictor variables, in the present data set these being sex and temperature. There are two possible null hypotheses: the two matrices are identical vs. the two matrices are unrelated. We have adopted the position of Shaw (1991) that the most appropriate null hypothesis is that of identical matrices.

T method

This method, similar to one discussed by Willis et al. (1991), uses matrix disparity as an index of difference between two or more matrices (Roff et al., 1999). The statistic examined is the sum of the absolute difference between matrix elements and thus for two matrices it is


where EijA and EijB are the elements of matrices A and B, and p is the number traits. Under the null hypothesis of equality, the expected value of TAB is zero and is tested using randomization. If a significant difference is found between matrices, then the source of this difference can be explored by examining the probabilities associated with the individual elements. This post hoc examination should be regarded as a hypothesis-generating method as the probabilities required for statistical significance when adjusted for multiple comparisons are likely to be extremely small. An alternative, more robust, approach to locating the source of variation is the jackknife-manova method described later.

Modified Mantel test

Lofsvold (1986) suggested comparing the covariances in pairs of matrices using the Mantel test. The null hypothesis under the Mantel test is that the two matrices are uncorrelated whereas a more appropriate null hypothesis is that the matrices are identical (Shaw, 1991). To include the entire matrix (i.e. the variances in addition to the covariances) and address the appropriate null hypothesis, Goodnight & Schwartz (1997) developed the following modified Mantel test. First, they suggest reducing the effect of variation in scaling by standardizing the variables: in the present data set, we did this because the traits themselves were measured on disparate scales (e.g. colour vs. development time). If all the traits were measured on the same scale, as would be true if all traits were, say, linear morphological components, then one might be concerned that standardization changes the relationships and thus the conclusions drawn: this problem has not been resolved and in these case it would be advisable to run the analysis with both unstandardized and standardized data to see whether standardization changes the conclusions. It must be emphasized that the method standardization of variables should not change the relative difference between the two matrices: this is true in the present case because variables were standardized by the standard deviation of the entire data set and not by individual populations. The modified Mantel formula is


where = 1/(p(p − 1)/2 + p). To address the null hypothesis of equality, Goodnight and Schwartz (1997) used a bootstrap approach in which entire sire families in their half-sib experiment were selected with replacement. We adopted the same procedure. No method of bootstrapping a half-sib design, or any other pedigree design, will preserve the original sampling design precisely, because, for example, in this case the number of sires in the bootstrap sample will necessarily be fewer than the number in the original sample. For this reason, caution should be exercised in interpreting the results.

The modified Mantel test removes differences in size between matrices and compares differences in shape (i.e. their correlation structures). The null hypothesis for a Mantel test is that the two matrices are uncorrelated but in this case the null hypothesis is the opposite, namely that the two matrices are identical. Therefore, the cases that deviate from the expected are those that are smaller than the observed, and Goodnight and Schwartz (1997) estimated the probability as = n/B where n is the number of MB ≤ Mobs and B is the number of bootstrap replicates. We also used a randomization approach and as with the bootstrap the appropriate probability was calculated as n/R, where R is the number of randomizations, n is the number in which Mr ≤ Mobs and Mr is the correlation in the randomized matrices.

Bartlett’s test

In contrast to the above test, Bartlett’s test compares difference in the size of matrices independent of shape (Goodnight & Schwartz, 1997). The test is based on a comparison of the determinants of the matrices, which are measures of the ‘volumes’ occupied by the matrices: thus, for example, under the null hypothesis the 95% boundaries of two matrices are equal (Fig. 1). Note, as shown in Fig. 1, that matrices may be of equal size but be otherwise quite different. In itself, Bartlett’s test is not very informative but is so in conjunction with the modified Mantel test when the latter test is not significant (i.e. a correlation of 1 cannot be rejected, Goodnight, pers. com.) or when the hierarchical test cannot reject the hypothesis of proportionality of matrix elements. The test statistic, B, is defined as


where df refers to degrees of freedom, which in the present analysis we set as the number of sires less one, Gcomb is the combined G matrix with elements


and inline image means the determinant of the G matrix. Extension to more than two matrices is obvious. Two matrices can be compared with a χ2 statistic




Bartlett’s test is very sensitive to the assumption of normality and assumes that the matrices are estimated directly from independent vectors, which in the case of genetic matrices they are not. Because both of these assumptions are questionable in estimating genetic covariance matrices, Goodnight & Schwartz (1997) adopted the bootstrap as an alternative. Note that in this case Bartlett’s statistic B can be used directly as C will be common to all estimates. Probability was estimated from the set of bootstraps as n/N where n is the number of MB ≥ Mobs and N is the number of bootstrap replicates. An important limitation of Bartlett’s test is that the matrices be positive definite (i.e. all positive eigenvalues). This may not true for a G matrix, either because there actually is no variation in a particular direction as specified by the eigenvalue or that the amount is sufficiently small that a singular matrix is estimated either in the original or bootstrapped matrix. In addition to using the bootstrap approach, we also used a randomization test using Bartlett’s statistic, details of which followed those outlined for the T method.

Rank test

Any genetic variance of covariance that is zero obviously denies evolution of that trait or trait combination. Similarly, a genetic correlation of ±1 defines a bidirection in which evolution is restricted to a fixed line. Less obvious are restrictions determined by the eigenvalues of the matrix. These values measure the genetic variance along orthogonal axes defined by the eigenvectors. A matrix of full rank is one in which all eigenvalues are positive. Eigenvalues that are zero or negative define directions in which no genetic variance exists and hence in which evolution cannot proceed. Thus, an analysis of the eigenvalues addresses the question of whether there are particular directions in which evolution is restricted. A comparison of the eigenvalues among matrices addresses the question of whether the orientation of the matrices is the same with respect to the magnitude of genetic variance. Thus, there are two questions that can be asked: first, what is the probability that the observed matrix is less than full rank, and second, what is the probability that two matrices differ in rank?

Various methods have been suggested for estimating the effective rank of a matrix. Mezey & Houle (2005) assessed the rank of a given matrix using a bootstrap approach. Hine & Blows (2006) found by simulation that bootstrapping tended to overestimate the matrix rank and favoured a method suggested by Amemiya et al. (1990) but the means of programming this approach and assigning the appropriate degrees of freedom are unclear and therefore not used in the present analysis.

Goodnight & Schwartz (1997) compared the difference in rank of two matrices with a bootstrap approach. The pairing of the two bootstrapped distributions was performed at random. An alternative approach would be to compare the two frequency distributions of the bootstrap data. A third method is to compare the difference between the observed matrices with that obtained by randomization. It is not clear in these tests what the appropriate null distribution would be: if the eigenvalues and sample sizes are all large, then two matrixes drawn from the same population will consistently be of full rank and hence not have different ranks 5% of the time. Similarly, if an eigenvalue is very small, then the ranks of the two matrices will differ more than 5% of the time. We have verified this with simulations and thus have not considered this approach further in this article. However, as described below we do consider an alternative method (the jackknife-eigenvalue test) based on the jackknife.

Hierarchical analyses

Whereas the eigenvalues determine the amount of genetic variation in the eigenvectors, the eigenvectors themselves determine the overall bias in evolutionary change. Thus, a comparison of the eigenvectors is a central component of the analysis of variation among genetic and phenotypic matrices. The Flury hierarchy, which is itself part of a larger hierarchical structure (Boik, 2002), extends the dimensionality approach of rank analysis by analysing the differences between the eigenvectors. Each eigenvector, or principal component, describes a linear combination of the traits that defines an axis orthogonal to all other principal components: there are as many principal components as there are traits and the variance accounted for declines in sequence with the principal components. Despite its popularity and obvious importance, there are potential problems in interpretation, and although it gives insight into some aspects of matrix structure, it cannot be considered a sufficient metric of matrix variation (Houle et al., 2002).

The Flury hierarchy recognizes a sequence that can be arranged in increasing similarity (Phillips & Arnold, 1999): (i) Unrelated structure, matrices share no principal components in common, (ii) Partial Common Principal Components, matrices share some of the principal components, (iii) Common Principal Components, matrices share all of the principal components but the eigenvalues differ, (iv) Proportionality, matrices have the same set of principal components but the eigenvalues of one matrix differ from another by a constant proportion, (v) Equality, matrices have identical principal components and eigenvalues. A set of programs for the analysis of the Flury hierarchy was developed by Phillips & Arnold (1999). Unfortunately, programs for the analysis of nested designs such as the half-sib pedigree design are not included in this package. Because of difficulties in programming the solution for the partial common principal components, we have restricted our analysis to the hierarchy with this component omitted. Testing for partial common principal components could be done by systematically reducing the number of traits, although this will lead to multiple tests and problems of assigning significance. The necessary equations for this analysis using Manly & Rayner (1987) and Trendafilov (2010), commencing in reverse order are as follows:

Equal matrices: The maximum likelihood estimator inline image given s matrices is


where Sj is the sample covariance matrix, nj is the ‘sample’ size of the jth matrix (here taken to be the number of sire families) and inline image.

Proportional matrices: The maximum likelihood estimates of the proportionality constants inline image is obtained by iteration of the two equations


To find the solution, we start with inline image for all j and substitute in eqn (8) to obtain inline image, which is then substituted into eqn (9) to obtain a new estimate of inline image. This process is continued until the change in the log likelihood, shown below, is less than a given tolerance value


Preliminary tests on the data sets examined in the present analysis indicated that 10 iterations were quite satisfactory, and we used this number throughout the analyses.

Common principal components: The estimation of the matrices under the assumption of common principal components was performed using the stepwise algorithm of Trendafilov (2010). Convergence is typically very rapid, and preliminary runs of the data sets analysed here indicated that 10 iterations were satisfactory.

Unrelated structure: The maximum likelihood estimate under the assumption of different matrices is the sample covariance matrix Sj.

We adopted the jump-up approach to model testing advocated by Phillips & Arnold (1999). Each model was tested against that of unrelated structure using the statistic


where inline image is the estimated genetic covariance matrix for model type M (equal, proportional, CPC) in the jth population. Probability was estimated by randomization using 499 randomizations. Those runs in which the determinant could not be calculated were dropped from the estimation of the probability.

Jackknife-manova test

The tests described above examine the variation in structure between matrices. Such variation could result from genetic drift, selection or some environmental factor that influenced phenotypic expression. We might expect that the signal of these effects acting singly or in concert to be contained in the matrix structure. For example, in the analysis of the G matrices of the isopod Gammarus minus, Roff (2002) examined the joint influence of two environmental variables (habitat and drainage basin) on variation in four populations, showing that both factors contributed to variation among matrices. In the present data set, there are two variables, temperature (18, 23, 28 °C) and sex. Differences among matrices may exist overall, irrespective of the sex/temperature combination, which is what the other tests focus on or they may be a function of one or both variables. Both types of variation can be analysed with the jackknife-manova method (Roff, 2002). This approach utilizes the jackknife statistical procedure to produce a set of pseudovalues and then uses manova to relate these variables to the predictor variables. As noted above, a useful property of this method is that individual anovas can be conducted to test element-by-element variation (Roff, 2002). In the present half-sib breeding design, pseudovalues were produced by eliminating entire sire families, giving a final data set of 20 pseudovalues for each trait by sex–temperature combination. A previous analysis showed that the type 1 error rate of the jackknife-manova method was as required (Begin & Roff, 2004) but is also given here for the G matrix used in the previous simulations.

Jackknife-eigenvalues test

In addition to estimating the genetic covariances, the jackknife method described above can be used to estimate the eigenvalues and their associated standard errors from which the probability of the estimate being greater than zero can be determined with a one-tailed t test. We also estimated the summed value of all eigenvalues, Esum, which is equivalent to the trace of the matrix and measures the total genetic variance (Kirkpatrick, 2009). The eigenvalues and Esum from different populations can be compared with a manova or a t test on the eigenvalue and Esum pseudovalues. The proposed test compares the ith eigenvalue (or Esum) from the set of matrices: it is thus an element-by-element test of the eigenvalues. The leading eigenvalues tend to be overestimated (Lawley, 1956; Walsh & Lynch 2011): although this may not be a serious problem when comparing the ith eigenvalue from several matrices (given that they will all be over- or under-estimated), it could indicate incorrectly that smaller eigenvalues are not significantly different from zero.

To address these concerns, we ran simulations asking the following questions. First, Does the method produce an unbiased estimate of the statistic? Second, Is the standard error estimated from the jackknife procedure a valid estimate of the standard deviation of the estimator? Third, What is the power of the test? The last question is particularly important because drawing the conclusion that a matrix is of reduced rank may be erroneous if the standard error associated with the estimate includes zero but is also large enough to include biologically important positive values. The variance accounted for by each eigenvalue forms a decreasing series, and thus the later eigenvalues may be relatively poorly estimated and, as noted above, may be under-estimated.

Random skewers test

A major reason for comparing G matrices is to ascertain whether different evolutionary trajectories will be followed if selection is applied. The random skewers method developed by Cheverud (1996) is a useful approach to this problem. The essential idea behind this method is to compare the colinearity of response of two matrices to a set of randomly applied selection vectors:


where RA and RB are the response vectors of populations A and B, respectively, GA and GB are the two genetic covariance matrices and β is the selection gradient, which is replaced here with the random skewers. Elements in the selection vector are drawn at random from a uniform distribution from −1 to +1 and the total length of the vector standardized such that the sum of squared vectors elements equals one (Cheverud & Marroig, 2007). This last procedure is strictly not required (Manly, 1997) and can thus be omitted to marginally increase processing speed. The vector correlation is calculated as (Calsbeek & Goodnight, 2009)


where ri is the ith vector correlation and T signifies matrix transpose. The response vectors are calculated as RjSi, where j = A or B and Si is the ith random skewer. If the matrices are identical, the vector correlation between responses will be one, whereas if the two matrices are completely unrelated, then the correlation will be zero. Because the random skewers method uses the estimated matrices as if they were the true matrices, the estimated vector correlation will generally be biased and tests of significance unreliable. However, statistical differences between the matrices can be determined using a bootstrap approach (Cheverud, 1989, 1996). Cheverud & Marroig (2007) took as the null hypothesis that matrices were unrelated, whereas Calsbeek & Goodnight (2009) assumed a null hypothesis of identical matrices. As noted above, we took the latter as our null hypothesis and tested using the same approach as for the modified Mantel test. First, we estimated the mean vector correlation based on 500 random skewers. We then constructed two new matrices by bootstrapping sire families from the two data sets and computed the resulting vector correlation. We repeated this process 500 times to generate a bootstrap distribution of mean vector correlations. The null hypothesis is that the two matrices are identical and thus the vector correlation is one. Therefore, the cases that deviate from the expected are those that are smaller than the observed (Goodnight & Schwartz, 1986) and the probability is estimated as n/B, where n is the number of rb ≤ robs and B is the number of bootstrap replicates. We also used a randomization approach and as with the bootstrap the appropriate probability was calculated as n/R, where R is the number of randomizations, n is the number in which rr ≤ robs and rr is the correlation in the randomized matrices.

Selection skewers test

The random skewers test is applied to either the genetic or the phenotypic matrices separately. As pointed out by Calsbeek & Goodnight (2009), an important question is whether the two populations are expected to respond to selection in the same manner, which is given by the multivariate breeder’s equation


where S is the selection vector. The random skewers approach can be applied in this case by using the product GP−1 in place of either G or P. The selection skewers can be constructed to apply selection in a particular direction if such is a focus of interest (Calsbeek & Goodnight, 2009) or, as in this case, by randomly selecting skewers as previously. Tests of significance can be constructed in the same manner.

What do the tests tell us about evolution?

Each test measures a different aspect of the G (or P) matrix and hence gives different information on how the G matrix affects evolutionary change and to some extent how drift and selection have influenced the evolution of the G matrix itself. The T method is the simplest in that if the null hypothesis is not rejected, we would provisionally predict that the same vector of selection gradients applied to both populations would produce the same set of direct and correlated responses. In this respect, it provides the same answer as the random skewers method, although the latter is a direct test of the response to selection by the same vector of selection gradients. It is also feasible, but perhaps unlikely, that different matrices might nevertheless produce the same response and hence not be differentiated by the random (or selection) skewers tests. Equality of the matrices suggests that the populations have been subject to the same selection regime and that drift has not been sufficient to cause differentiation or has caused the same degree of change.

The modified Mantel test measures differences in the structure of the matrices: under the null hypothesis, the correlation between the elements of the two matrices is 1. This means that the elements of one matrix are proportional to the elements of the other matrix, equality being one possibility. Proportionality could come about due to genetic drift (Lande, 1979) or selection on traits that are highly correlated (Roff, 2004). The effect on response to selection will depend on changes in the phenotypic variance–covariance matrix but the general prediction is that the responses to selection will differ. If the selection gradients are the same, then the population with the larger G matrix will respond faster.

Bartlett’s test focuses on size (‘volume’) of the matrices. In the example shown in Fig. 1, the two matrices do not differ in size but quite clearly differ in the shape of the displayed ellipse, and the responses to selection on trait 2 will be much stronger in the long ‘skinny’ one than the ‘fat’ one. Because the matrices may differ in other ways, this test is most useful (and interpretable) only if the modified Mantel test or hierarchical test of proportionality are not significant.

The rank test compares the number of significant eigenvalues, the null hypothesis being equal rank. If both matrices are of full rank, then evolution is not constrained in any direction, whereas reduced rank means that there are directions (combinations of trait values) that selection can move the population. The jackknife-eigenvalue analysis addresses both the question of the rank of a matrix and the question of whether the equivalent eigenvalues are the same. Significant differences in rank suggest that selection has been strong enough to erode genetic variance (eigenvalue) in one principal component (axis) in one population more than the other. In the scenario displayed in Fig. 1, the eigenvalues of the major axes are the same but those of the minor axis differ. Thus, the population with reduced variance in the minor axis will respond less in this direction to a given selection intensity than the other population.

The hierarchical analyses jointly examine the orientation of the axes (the principal components or eigenvectors) and the amount of variance (eigenvalue) along these axes (Fig. 1). Matrices that are equal have identical elements and thus will respond in the same manner to selection. This component of the hierarchical analysis corresponds to the T and jackknife-manova methods. The first level of matrix differentiation is that of proportionality in which the elements of one matrix are a multiple of the other matrix: this is the same hypothesis tested by the modified Mantel test. The second level tested here is that the matrices are aligned along the same axes (principal components) but may differ in the relative amount of variance along these axes. The response to selection will depend on the relative differences in genetic variance (eigenvalues) in the component axes (principal components). Finally, the matrices may differ in at least one principal component: note that with two traits, a difference in one principal component will necessarily entail a difference in the second (Fig. 1). Unrelated structures suggest that the G matrices have been subject to very different selection pressures and will obviously show differences in response to the same selection regime.

The jackknife-manova tests the hypothesis of matrix equality but also how matrix variation may be related to other variables. For example, the jackknife-manova test of four populations of the isopod G. minus suggested that both selection and drift were responsible for differences in the G matrices (Roff, 2002). This method can also be used to locate the specific elements of the matrices responsible for the overall difference.

As discussed above, the jackknife-eigenvalue tests both the hypothesis that an eigenvalue is not significantly different from zero (in which case there is no genetic variation along that particular axis) and differences between the equivalent eigenvalues in the two matrices. In the example shown in Fig. 1, the two matrices do not differ in genetic variance along the major axis but do along the minor axis. Selection in the general direction of the second axis will produce greater response in the population represented by the upper ellipse than in that represented by the lower ellipse.

The selection skewers test explicitly tests the null hypothesis of no differences in response to the same vector of selection gradients (random skewers test) or selection differentials (selection skewers test). The former implies the same G matrices and is thus equivalent to the T and jackknife-manova tests in this regard.


Simulation results

Results of the simulations for all except the jackknife-eigenvalue test are summarized in Table 3 and Fig. 2. With the exception of the T method, the randomization and jackknife-manova approaches provide acceptable distributions. The bootstraps never produced distributions acceptable over the full range and thus cannot be considered preferential to the randomization approach for the present data sets. We therefore did not use the bootstrap methods in the analysis of the experimental data. The T method gave a distribution that was conservative and so was used with this caveat.

Table 3.   Kolmogorov–Smirnov goodness-of-fit test of simulated results on the expected uniform distribution with range 0–1. See table and text for discussion of jackknife-eigenvalue test.
TestK–S statisticPComments
T test0.2150.0002Conservative
Modified Mantel test – bootstrap0.420< 0.0001Unacceptable
Modified Mantel test – randomization0.0750.6272Acceptable
Bartlett’s test – theoretical0.241< 0.0001Unacceptable
Bartlett’s test – bootstrap0.555< 0.0001Unacceptable
Bartlett’s test – randomization0.0690.748Acceptable
Hierarchical – equality0.0950.3238Acceptable
Hierarchical – proportional0.1020.2492Acceptable
Hierarchical – CPC0.0960.3142Acceptable
Random Skewers – bootstrap0.292< 0.0001Unacceptable
Random Skewers – randomization0.0640.8073Acceptable
Selection Skewers – bootstrap0.268< 0.0001Unacceptable
Selection Skewers – randomization0.0620.8367Acceptable
Figure 2.

 Cumulative frequency plots of probabilities from 500 replicate simulation runs. • Randomization, ○ Bootstrap. Results for the hierarchical analyses are too similar to permit visual separation.

For the jackknife-eigenvalue test, we ran 5000 simulations and for each run tested for eigenvalues greater than zero using a one-tailed t test. Results are shown in Table 4. Bias was negligible and the estimated standard errors were very close to the standard deviations of the 5000 simulated sets. Thus, at least for this data set, the problem of overestimation of the leading eigenvalues and underestimation of the smaller eigenvalues appears to be negligible. Power (i.e. the probability of correctly rejecting the null hypothesis that the mean was less than or equal to zero) was extremely high for Esum and the first three eigenvalues but lower than the required 80% (Cohen, 1988) for the fourth. Given the power associated with the test on the fourth eigenvalue, conclusion that the matrix was of reduced rank has to be viewed with caution.

Table 4.   Results of an individual-based simulation model to test the efficacy of the jackknife method of estimating the eigenvalues of a G matrix. See text for details.
Actual valueMean jackknife estimatePercentage bias (SE)SD of estimatesEstimated SE from jackknifePower
P (Ei > 0)
  1. *Jackknifed sum of the four eigenvalues.

10.9751.012−3.8 (0.4)0.300.281.00
20.7830.7395.6 (0.4)
30.3320.333−0.3 (0.2)
40.1510.154−2.1 (0.1)
E.Sum*2.2402.2370.1 (0.6)0.390.391.00

Experimental results

The jackknife-manova found no overall differences due to sex (F10,105 = 0.89, = 0.5439), temperature (F20,210 = 1.1819, = 0.2722) or interaction (F20,210 = 0.8882, = 0.6023). A manova on the eigenvalue pseudovalues showed no significant interaction or effect of sex (> 0.5 in both cases) but a significant temperature effect (F8,222 = 2.30, = 0.00988). T tests on the individual eigenvalues showed the first three to be consistently and significantly different from zero but several of the fourth eigenvalues had confidence limits that spanned zero (Table 5).

Table 5.   Jackknife estimates of eigenvalues for the six different temperature–sex combinations.
CombinationsEigenvalues (SE)
  1. Probabilities of the eigenvalues are less than zero. *P < 0.004, $0.004 < P < 0.01, &0.01 < P < 0.05. For the individual combinations, the significance level after Bonferroni correction for multiple (24) tests is 0.004 (0.002 if a two-tailed test is used. This does not change the results).

18Female0.543 (0.131)*0.877 (0.142)*0.205 (0.105)&0.081 (0.060)
18Male0.989 (0.249)*0.834 (0.183)*0.1827 (0.068)$0.342 (0.099)*
23Female0.946 (0.228)*0.411 (0.117)*0.358 (0.131)$0.048 (0.056)
23Male0.927 (0.279)*0.522 (0.157)*0.451 (0.158)$−0.035 (0.081)
28Female0.524 (0.153)*0.567 (0.111)*0.288 (0.118)&0.145 (0.069)&
28Male0.838 (0.254)*0.418 (0.129)*0.290 (0.118)&0.121 (0.115)

Because of the problem of multiple tests, pairwise comparisons have to be viewed cautiously. The probabilities of getting 1, 2 and 3 significant results at the 5% level are 0.38, 0.19 and 0.06, respectively: at the 1% level, they are 0.17, 0.016 and 0.00096, respectively. The primary purpose of the pairwise comparisons is to examine the pattern of results among the different tests. The jackknife-eigenvalue tests form one group whereas the remainder of the tests, with the possible exception of the jackknife-manova test, forms a second (Table 6). Results of pairwise comparisons indicate that eigenvalues differed in up to seven of the combinations, and in all cases, eigenvalues 2 and/or 4 were responsible (Table 6). The two combinations identified by the jackknife-manova as possibly different were the same as two identified by the eigenvalue tests. In contrast, the other tests consistently showed a difference between females at 23 °C and males at 28 °C. Additionally, Bartlett’s and the Hierarchical tests found a difference between males at 18 °C and females at 23 °C, a combination that the jackknife-eigenvalue tests also found to be highly significant.

Table 6.   Summary of pairwise matrix comparisons in which < 0.1.
CombinationTMBHierarchical testsSkewersJ-manovaEigenvalue tests
  1. *Test not possible because fourth eigenvalue of matrix was negative (Table 4).

F18M18        0.0730.056   0.0300.093
F18F23        0.0320.098 0.016   
F18M23  na*            
F18F28           0.094   
F18M28           0.022   
M18F23  0.0460.0440.0490.065   0.008 0.059 0.013 
M18M23  na      0.015   0.005 
M18F28         0.051    0.040
M18M28           0.071   
F23M23  na            
M23F28  na          0.099 
M23M280.088 na            

To further identify similarities between tests, we computed the correlation between the probabilities estimated by each test (Table 7). Probabilities from all the tests except the eigenvalue tests tend to be highly correlated, whereas the jackknife-eigenvalue tests are only correlated with each other. A principal component analysis of the probabilities, with Bartlett’s test deleted because of missing values, shows that PC1 is heavily weighted to those tests other than the jackknife-eigenvalue tests whereas PC2 has high negative weights to the hierarchical tests (Table 8). A plot of PC2 on PC1 suggests that there are four groupings: the jackknife-eigenvalue tests, the Hierarchical tests, the Skewer analyses and the remainder (modified Mantel, T method, jackknife-manova), with the latter two possibly forming a single group (Fig. 3).

Table 7.   Correlations (below diagonal) between probabilities estimated from each testing method.
 TMBHierarchical testsSkewersJ–MEigenvalue tests
  1. #0.1 > > 0.05; ##0.05 > > 0.01; ###0.01 > > 0.001; *< 0.001.

T method *##### *####      
Mantel0.89 ##### *######      
Bartlett0.740.83 **###*#####      
Equal0.530.510.99 **  ##      
Proportional0.510.490.980.99 *  ##      
Random skewer0.840.880.960.400.380.10 *###      
Selection skewer0.560.760.860.160.16−0.020.82 ##      
J-Eigen0.  ########
E2−0.22−0.15−0.51−0.01−0.010.03−0.35−0.34−0.220.56−0.01 #  
E30.060.140.18−0.33−0.37−0.460.260.340.070.56−0.220.46 # 
E40.100.110.45−0.06−0.13−0.320.350.230.220.53−0.11−0.120.50 *
Table 8.   Weightings for first five PC scores. Weightings sorted according to coefficient for first PC.
 Comp. 1Comp. 2Comp. 3Comp. 4Comp. 5
Random skewer0.390.17−0.190.01−0.05
T method0.390.02−0.14−0.07−0.14
Selection skewer0.290.18−0.35−0.180.05
Cumulative percent36.6658.7975.1284.5392.86
Figure 3.

 Bivariate plot of PC2 on PC1.

Genetic correlations are highly significantly correlated with the phenotypic correlations (= 0.79, F1,34 = 56.6, < 0.0001) but twice the absolute value of the phenotypic (rG = −0.06(SE = 0.04) + 1.94(0.26)rP, Fig. 4). Thus, in this case, contrary to Cheverud’s conjecture (Cheverud, 1988), the phenotypic correlations could not be used as surrogates for the genetic correlations.

Figure 4.

 Plots of genetic correlation vs phenotypic correlations. Solid line shows regression line. Dashed line shows 1:1 relationship.


The simulation results indicate that the bootstrap approach is generally inappropriate for the present data set. This is not surprising given the relatively small sample (20 sires) upon which the bootstraps are drawn. Unfortunately, this will be a common problem with genetic data in which, although the total sample size may be relatively large (∼270 per treatment and sex in the present case), the unit of resampling is an aggregate of the raw data. Bootstrap estimates drawn from small samples will typically be biased and have confidence intervals smaller than expected (Roff, 2006). Without simulation, it is not a priori clear what constitutes a sample large enough for the bootstrap to be a valid procedure. On the other hand, the present results show that the bootstrap can be replaced in all cases by randomization. Thus, for the purpose of comparing G matrices, we suggest using randomization rather than the bootstrap. If interest lay in the actual distribution of genetic parameters, then a bootstrap approach would be appropriate. However, this should not be used without first investigating its properties with respect to the characteristics of the data set under study.

The reason for the proliferation of tests for comparing G matrices is that in principle matrices can be different in several ways and each test is sensitive to different aspects of the variation (see Table 2 and section ‘What do the tests tell us about evolution?’). For the present data set, our general empirical finding is that the tests fall into two broad categories: tests based on variation in eigenvalues and all other tests. Finer-scale examination of the latter suggests that these fall into two or possibly three groups: hierarchical tests, skewers and the rest (jackknife-manova, modified Mantel, T method, probably Bartlett’s). The eigenvalue tests that appeared to suggest more combinations were likely different than the other tests. The other tests all agreed on variation in a single combination, which was not indicated by the eigenvalue tests.

Most tests only test for differences among matrices, generally in a pairwise fashion. The jackknife procedure differs in that it permits not only this type of testing but also tests that incorporate predictor variables (Roff, 2002). In the present analysis, the jackknife-manova test identified no differences due to sex or temperature, whereas the jackknife-eigenvalue tests, which are based on the pseudovalues from a delete-one jackknife, showed significant differences due to temperature.

Demonstrating differences among G matrices is important but one should not be particularly surprised when such differences are found. The critical questions are, What processes generated these differences and are the differences biologically relevant? In this regard, an examination of the jackknife-eigenvalue tests is illuminating. These tests indicated differences typically in the second and/or fourth eigenvalue. The fourth eigenvalue accounts on average for only 6% of the total genetic variance, which means that there is very little variance in the direction of the fourth eigenvector upon which selection can act. This reduced variance may be the result of erosion by past selection or simply a statistical artefact resulting from the way variance is apportioned among the eigenvectors. Differences in the fourth eigenvalue among the matrices may represent past selection acting directly on the combination of traits represented by this eigenvector or by selection acting on other trait combinations. The conclusion one can draw from the jackknife-eigenvalue tests is that significant differences among the fourth eigenvalues are not of much contemporary biological significance as there is so little genetic variance in this component. However, the second eigenvalue accounts for an average of 33% of the total genetic variance and thus there is plenty of scope for selection to cause divergence in the trait combination represented by the second eigenvector.

The fact that a suite of tests gives more or less the same result in the present analysis should not be taken to indicate that the comparison of G matrices can be typically achieved using only one of these tests. In the present case, the tests do appear to be sensitive to the same difference among the matrices but it is possible that in other data sets the matrices may differ in such a manner that not all tests will pick up the difference. Given the relative ease with which all these tests can be administered, we suggest that they all be applied; though, the T method could be omitted as it is perhaps too conservative.


We are most grateful for the advice and assistance of Drs. Bruce Walsh and Charles Goodnight.

Data deposited at dryad: doi: 10.5061/dryad.kb27f3t1