nadiv : an R package to create relatedness matrices for estimating non-additive genetic variances in animal models

Authors

  • Matthew E. Wolak

    Corresponding author
    1. Graduate Program in Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
    2. Department of Biology, University of California, Riverside, CA 92521, USA
      Correspondence author. E-mail: matthew.wolak@email.ucr.edu
    Search for more papers by this author

Correspondence author. E-mail: matthew.wolak@email.ucr.edu

Summary

1. The Non-Additive InVerses (nadiv) R software package contains functions to create and use non-additive genetic relationship matrices in the animal model of quantitative genetics.

2. This study discusses the concepts relevant to non-additive genetic effects and introduces the package.

3. nadiv includes functions to create the inverse of the dominance and epistatic relatedness matrices from a pedigree, which are required for estimating these genetic variances in an animal model. The study focuses on three widely used software programs in ecology and in evolutionary biology (ASReml, MCMCglmm and WOMBAT) and how nadiv can be used in conjunction with each. Simple tutorials are provided in the Supporting Information.

Introduction

A major advance for the study of quantitative trait evolution in wild populations was precipitated by the adoption of the ‘animal model’, a mixed effects model with a long and proven history in the animal breeding sciences (Henderson 1984; Lynch & Walsh 1998; Kruuk 2004). Using the similarity among relatives to elucidate the underlying genetic basis of phenotypic variation at the population level, the method (1) enables researchers to control (or study in and of themselves) confounding factors because of environmental or other non-heritable sources of similarity between relatives, (2) simultaneously utilizes additional relationships beyond parent-offspring or half- and full siblings in the estimation of genetic parameters, thereby increasing the types of populations and organisms able to be studied and (3) is unbiased to selection within a population (Lynch & Walsh 1998; Kruuk 2004). Response variables in animal models may be univariate, multivariate, Gaussian or non-Gaussian. Further, solutions to the animal model may be obtained using Likelihood or Bayesian approaches (further information in the Supporting Information Relatedness matrices in the animal model section and detailed descriptions of the animal model can be found in Lynch & Walsh 1998; Sorensen & Gianola, 2002; Kruuk 2004; Mrode, 2005).

The phenotypic variance of a quantitative trait can be broken down into additive genetic, non-additive genetic and environmental sources of variation. The non-additive genetic variance can be further subdivided into dominance and epistatic variances. The additive, dominance and epistatic genetic variances are proportional to the probability that individuals share alleles identical by descent at the same locus, for both alleles at the same locus, or for alleles at different loci, respectively. If one knows all the relationships in a population (i.e. the pedigree) then the above genetic variances can be estimated in an animal model.

Non-additive genetic variances are seldom, if ever, estimated in ecological and evolutionary analyses (but see, Crnokrak & Roff 1995; Waldmann 2001), although the fields of animal and plant breeding have been estimating these genetic variances for over two decades (e.g. Hoeschele 1991; Tempelman & Burnside 1991). This could be, in part, because non-additive genetic effects are assumed to be of little importance in predicting the evolutionary trajectory of moderately sized wild populations (Fisher 1958). Also, studies of wild organisms typically have low numbers of individuals in a population, especially compared to the millions often handled in animal breeding. This is problematic, because datasets with too few individuals preclude the inclusion of too many random effects in an animal model (Kruuk 2004) and have been shown to be problematic for the estimation of dominance variance (Misztal 1997). However, if dominance genetic effects are present, but not included in an animal model, they can potentially bias the prediction of the additive genetic effects as well as the estimate of additive genetic variance (Lynch & Walsh 1998; Ovaskainen, Cano & Merilä 2008; Waldmann et al. 2008; but see Misztal, Lawlor & Fernando 1997). Additionally, non-additive effects are of central interest to a number of evolutionary hypotheses, for example: dominance and epistasis are expected to contribute substantially to variation in fitness (Wright 1929; Haldane 1932; Fisher 1958; Crnokrak & Roff 1995; Merilä & Sheldon 1999); non-additive variance may determine the extent to which additive genetic variance increases after bottlenecks (Cockerham & Tachida 1988; Goodnight 1988; Willis & Orr 1993; Barton & Turelli 2004); epistasis can shape additive genetic effects and variances during processes such as mutation and selection (Gavrilets 1993; Hermisson, Hansen & Wagner 2003; Carter, Hermisson & Hansen 2005) which has consequences for the evolution of sex and recombination (Charlesworth 1990); epistasis plays an integral part in speciation through the evolution of Dobzhansky-Muller incompatibilities (Crow & Kimura 1970; Orr 1995; Welch 2004); the sign of genetic correlations between fitness-related traits may depend on the amount of dominance variance (Curtsinger, Service & Prout 1994; Roff 1997; Merilä & Sheldon 1999); dominance potentially causes inbreeding depression or heterosis (Roff 1997) especially in small populations of conservation concern (Waldmann et al. 2008); and sex-linked dominance effects may play a role in the evolution of sexually dimorphic traits (Fairbairn & Roff 2006).

Aside from being unable to obtain meaningful estimates of non-additive variances as a result of the overall size of a population (see “Sampling covariances and confidence intervals” below), the next challenge to including dominance and epistasis in animal models is constructing the non-additive genetic relationship matrices (i.e. dominance matrix D and the three digenic epistatic matrices: additive by additive AA, additive by dominance AD and the dominance by dominance DD– where the additive genetic relationship matrix is represented by A and boldfaced, upper-case letters indicate a matrix). A further challenge is to obtain the inverses of these matrices, which is what is required to solve the system of equations in the animal model. Although the process of constructing the necessary matrix inverses has been worked out (e.g. Hoeschele & VanRaden 1991), only the creation of the additive inverse matrix has been incorporated into software used by most ecologists and evolutionary biologists: ASReml (Gilmour et al. 2009), MCMCglmm (Hadfield 2010) and WOMBAT (Meyer 2007). This study gives an overview of the software package nadiv (Non-Additive InVerses), implemented in the widely used statistical program R (R Development Core Team, 2011), which can be used to construct dominance and epistatic genetic relatedness matrices and their inverses. The inverses can subsequently be used in a variety of animal model software programs for univariate or multivariate analyses of quantitative traits. Below, examples briefly demonstrate the main functions using nadiv’s simulated data set warcolak.

Dominance relatedness matrix construction: makeD()

The relatedness in dominance genetic effects between individuals i and j, or coefficient of fraternity (Δij), can be approximated by:

image((eqn 1))

(pp. 140–141 in Lynch & Walsh 1998) where k and l represent the dam and sire of i, m and n the dam and sire of j, and θ is the additive genetic relatedness between individuals noted in the subscripts (elements in A). For a list of coefficients of fraternity between common types of relatives, I refer the reader to Lynch & Walsh (1998, table 24·1 on p. 721) or tables 4 and 5 from Fairbairn & Roff (2006). Equation 1 assumes no inbreeding and ignores dominance connections through grandparents, both for the sake of computational tractability (Ovaskainen, Cano & Merilä 2008). All pairwise Δij in a population can be approximated using the makeD() function of nadiv, assuming no inbreeding. Accounting for the presence of inbreeding in the relatedness matrix adds a great deal of complexity to the estimation of dominance in an animal model (Smith & Mäki-Tanila 1990). Despite the potential for inbreeding to alter the estimates of Δij, de Boer & van Arendonk (1992) showed an unbiased impact on the estimates of random effects in an animal model when inbreeding is moderately low and included as a fixed effect in the model.

Similar to algorithms that construct the additive genetic relatedness matrix (or its inverse), makeD() requires a pedigree as the main input. The pedigree must contain three columns, ordered ID, Dam, Sire, and the rows are ordered such that all parents occur in the ID column before their offspring (if not, see fixPedigree() in pedantics; Morrissey & Wilson 2010). All unknown parents (e.g. the base population) should be indicated with ‘NA’, ‘0’ or ‘*’:

Id dam sire
1 NA NA
2 NA NA
3 2 1
4 NA 1

The output of makeD() is a list of objects, from which the inverse of the dominance relatedness matrix can be extracted in two forms, depending upon the program in which it is intended to be used. First, the output Dinv is the inverse of the sparse matrix D and can be included in an animal model using MCMCglmm, as demonstrated below (see the MCMCglmm tutorial in the Supporting Information for more details):

  • > warcolak.ped <- warcolak[, c(1:3)]

  • > Dinv <- makeD(warcolak.ped)$Dinv

  • > warcolak$IDD <- warcolak$ID

  • > model.MCMC <- MCMCglmm(phenotype ∼ 1,

  • + random = ∼ID + IDD, data = warcolak,

  • + ginverse = list(ID = Ainv, IDD = Dinv)

The object listDinv is the second form by which the inverse of the dominance relatedness matrix is returned from makeD(). It is formatted so as to facilitate inclusion in either ASReml or the ASReml-R package. This object is in the form of ASReml’s general inverse list (also referred to as a g-inverse or giv; Gilmour et al. 2009), which contains the non-zero elements of the lower triangle of a sparse matrix, in row order. This can be used to include dominance as a random effect in the asreml() function in R (more details in the Supporting Information):

  • > ginvD <- makeD(warcolak.ped)$listDinv

  • > model.asr <- asreml(phenotype ∼ 1,

  • + random = ∼ped(ID) + giv(IDD), data = warcolak,

  • + ginverse = list(ID = ginvA, IDD = ginvD))

The listDinv object can also be written to a text file for inclusion in the analyses using the standalone ASReml program (Supporting Information). Further, this format is very similar to what WOMBAT requires; however, the first two columns must instead be ordered ‘column’ and then ‘row’ (the opposite order of listDinv) and the log determinant of D must also be provided. The first two columns of the list can easily be switched in R before saving the inverse to a file. The log determinant is returned as the object logDet in makeD() (Supporting Information).

Dominance relatedness matrix construction: makeDsim()

Ovaskainen, Cano & Merilä (2008) elegantly explain how eqn 1 yields an approximation of Δij, and demonstrate a more accurate method, especially for complex pedigrees, to obtain estimates of D through iteration. Briefly, their method explicitly traces alleles through a pedigree, thereby incorporating effects of inbreeding and alternative routes by which alleles can be shared (two processes left out of eqn 1). By repeatedly implementing this method, an estimate of the coefficient of fraternity (i.e. the probability two individuals share both alleles identical by descent) is produced and standard errors (diminishing in magnitude with an increase in number of iterations) for the estimates in the D matrix can be calculated. The difference between the coefficients of fraternity derived from this method and eqn 1 is explained in Ovaskainen, Cano & Merilä (2008). The function makeDsim() implements this method as described in the appendix to Ovaskainen et al. R code, such as makeDsim (warcolak.ped, N = 10000, calcSE = TRUE), will construct the D inverse in matrix and list formats for use in an animal model. The resulting output can then be supplied to MCMCglmm, asreml, ASReml, or WOMBAT as described earlier and indicated in the Supporting Information. The argument N = in makeDsim supplies the number of iterations and thereby influences the standard error of each entry in D.

Epistatic relatedness matrix construction

In addition to the dominance matrix, three digenic epistatic relationship matrices (AA, AD and DD) can be constructed using the functions makeAA() and makeDomEpi() (for example, coefficients of relatedness because of digenic epistasis, see p. 145 of Lynch & Walsh 1998). The latter of these two functions can construct and invert D, AD and DD, all at once to save computing time. The results returned by both of these functions can be passed to MCMCglmm, asreml, ASReml and WOMBAT in the exact same way as previously discussed for makeD().

Sampling covariances and confidence intervals

One difficulty when estimating non-additive genetic variances is that the covariance between relatives because of non-additive genetic effects is highly confounded with other sources of similarities between relatives (e.g. full siblings also display phenotypic similarities because of shared additive, maternal and environmental effects). The sampling (co) variances for all random effects in an animal model can be informative for determining the extent to which random effects are confounded. These (co) variances of the variance estimates are derived from the ‘Average Information’ matrix in animal models that utilize the Average Information algorithm (Gilmour, Thompson & Cullis 1995) to obtain the Residual Maximum Likelihood (REML) parameter estimates. The function aiFun() extracts the sampling (co) variances from the Average Information matrix in asreml, allowing researchers to evaluate the precision and extent to which variance components are correlated with one another:

  • > aiFun(model = model.asr, Dimnames = c(“Va”, “Vd”, “Ve”))

Further, the Supporting Information demonstrates how a vector of these (co) variances can be obtained from the standalone ASReml or WOMBAT programs and used in R. The sampling (co) variances are organized into a matrix with the sampling (co) variances of each variance component as the diagonal and below-diagonal elements and correlations as the above-diagonal elements. MCMCglmm uses a Bayesian approach to fitting models, not REML, but similar evaluations can be obtained by inspecting the posterior distributions and autocorrelation for variance components (Supporting Information).

Determining the extent to which variance components are confounded with one another can also be achieved after an asreml analysis by examining the profile likelihood surface of each component using proLik():

  • > profile.add <- proLik(model.asr, component = “ped(ID)!ped”)

A profile likelihood is a representation of the model log likelihood when projected onto the parameter space for one particular parameter (or subset of parameters; Meyer 2008). The change in the model log likelihood (calculated as a likelihood ratio test statistic) can then be estimated along a range of values for a particular parameter, producing a profile likelihood surface. When graphically depicted, using plot.proLik(profile.add), the profile likelihood surface of each variance component in an animal model (Fig. 1) can be visually inspected to yield insights into the ability of the pedigree structure to produce precise and unconfounded variance component estimates (Meyer 2008). An additional utility of profile likelihoods is that they can be used to determine confidence intervals for the variance components estimated in a mixed model. This is often a more appropriate method than using the standard errors (or sampling variances from the Average Information matrix; Meyer 2008). Approximate 1- α upper and lower confidence limits can be obtained when using the proLik() function, for example by: profile.add$UCL and profile.add$LCL, respectively. The accuracy of the approximated confidence limits can be set with the threshold argument.

Figure 1.

 Log profile likelihoods for the additive (top) and dominance (bottom) genetic variance components estimated from the warcolak dataset. Plots were generated using the nadiv function proLik() to obtain each profile from an animal model fitted using the software ASReml-R and subsequently graphed using plot.proLik(). The 95% confidence interval limits for each variance estimate are indicated where the horizontal dashed line (corresponding to the log Likelihood Ratio Test statistic = inline image) intersects the profile. X-axis labels correspond to the ASReml-R model terms.

Additional functions

A few other functions are included in nadiv and may be useful to others working with pedigrees and sparse matrices (matrices containing mostly zeroes) in R. Notably, makeA() constructs the additive genetic relatedness matrix. sm2list() converts a sparse matrix (see the Matrix package) to a list consisting of three columns (‘row’, ‘column’ and value – the last being labelled by the user) that contain all non-zero, lower triangle elements of a matrix in row order. Finally, double first cousins are an informative relationship for estimating many types of genetic variance (e.g. Fairbairn & Roff 2006). The function findDFC() determines the number of unique pairs of double first cousins present in a pedigree.

Space, speed and saving

Constructing the inverse of D can require a large amount of computer memory and time for large, complex pedigrees. Although some modified methods to address these constraints exist (e.g. Hoeschele & VanRaden 1991; Schaeffer 2003), the functions contained in nadiv can be executed in a timely manner for the size and complexity of pedigrees usually studied in ecology and evolutionary biology (<10 000 individuals), even on personal computers. Additionally, automatic parallelization of the processing is available for many of the functions in nadiv (the default is always to use a single processor), which can often result in dramatic time savings (Supporting Information). Not all computer architectures will allow users to take advantage of this capability in R, so I refer those interested to the package documentation of nadiv for more consideration. Because creating D every session is time prohibitive for large populations, it is advisable to save non-additive inverse matrices to a hard drive. The R functions save() and load() are useful to store and retrieve, respectively, because they preserve the R attributes that are required by the animal model programs in R (i.e. MCMCglmm and asreml).

More information about the functions in nadiv can be obtained from the package documentation (see the Comprehensive R Archive Network website: http://cran.r-project.org/web/packages/nadiv/index.html). For a more thorough treatment of how to use the functions in nadiv, please see the Supporting Information tutorials.

Acknowledgements

Special thanks to V. Careau, D.A. Roff, P.B. de Villemereuil and the participants of WAMBAM 2011 for insightful conversations about non-additive sources of covariance between relatives. Additionally, thanks to M.B. Morrissey, an anonymous reviewer, and the Associate Editor for comments that improved this manuscript. M.E.W. is supported by the National Science Foundation through a Graduate Research Fellowship. This work was supported through a NSF grant to D.J. Fairbairn, D.A. Roff and M.E.W. (DDIG award number 1110617).

Ancillary