### Abstract

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Pedigrees, depicting the genealogical relationships between individuals in a population, are of fundamental importance to several research areas including conservation biology. For example, they are useful for estimating inbreeding, heritability, selection, studying kin selection and for measuring gene flow between populations. Pedigrees constructed from direct observations of reproduction are usually unavailable for wild populations. Therefore, pedigrees for these populations are usually estimated using molecular marker data. Despite their obvious importance, and the fact that pedigrees are conceptually well understood, the methods, and limitations of marker-based pedigree inference are often less well understood. Here we introduce animal conservation biologists to molecular marker-based pedigrees. We briefly describe the history of pedigree inference research, before explaining the underlying theory and basic mechanics of pedigree construction using standard methods. We explain the assumptions and limitations that accompany many of these methods, before going on to explain methods that relax several of these assumptions. Finally, we look to future and discuss some recent exciting advances such as the use of single-nucleotide polymorphisms, inference of multigenerational pedigrees and incorporation of non-genetic data such as field observations into the calculations. We also provide some guidelines on efficient marker selection in order to maximize accuracy and power. Throughout we use examples from the field of animal conservation and refer readers to appropriate software where possible. It is our hope that this review will help animal conservation biologists to understand, choose, and use the methods and tools of this fast-moving field.

### Introduction

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Pedigrees, depicting the genealogical relationships between individuals in a population, are one of the best-understood biological concepts. They are of fundamental importance to research areas including conservation biology. For example, they can be used in making estimates of inbreeding (e.g. Collevatti *et al.*, 2007; Blackmore & Heinsohn, 2008; Richard *et al.*, 2009), heritability (e.g. Garant & Kruuk, 2005; Charmantier *et al.*, 2006) and gene flow (Zeyl *et al.*, 2009). They are also important for examining captive populations (e.g. Nielsen, Pertoldi & Loeschcke, 2007) and inferring breeding behaviour.

Pedigrees from direct observations of reproduction are rare for wild populations. Instead workers use molecular marker data, derived from samples taken from the animals themselves or from scat, fur or feathers, to infer relationships statistically. Despite the importance of these pedigrees and the fact that they are, in principle, well understood, the methods and limitations of their construction are often less well understood. With this review we will shed light on these processes and explain the mechanics of their construction, and their assumptions and limitations with particular reference to conservation biology.

### Uses in animal conservation studies

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Molecular marker-based pedigrees are of great importance in evolutionary and conservation biology. One area where pedigrees are particularly useful is in the study of wild animal mating systems where direct observation of mating is impossible or misleading. Such studies are fundamental to our understanding of life-history strategies, and have important implications for conservation biology.

Before the advent of molecular marker-based pedigrees, accurate estimates of extra-pair paternities (EPPs) were impossible and studies of cooperative breeding were similarly handicapped. Now, numerous studies have used molecular marker-based pedigrees and these indicate that EPPs are common in nature (Akcay & Roughgarden, 2007; Cohas *et al.*, 2007). We now know that relationships in social pedigrees cannot be trusted and genetic measures such as selection or heritability derived from such pedigrees may be biased (Charmantier & Reale, 2005).

An interesting example is provided by Gottelli *et al.* (2007), who examined the mating behaviour of cheetahs *Acinonyx jubatus* in the Serengeti National Park, Tanzania. Generally, male reproductive success is expected to increase with multiple mating and it is, therefore, usually assumed that males are promiscuous while females, for whom the benefits of multiple mating are not obvious, are coy. However, their paternity analysis demonstrated that a high proportion of litters included offspring of more than one male indicating that, for cheetahs, female promiscuity is high. They note that a large proportion of the paternities are inferred to be from unsampled males from outside of the study area and conclude that promiscuity is a strategy to minimize inbreeding. Carpenter *et al.* (2005) also made use of a molecular marker-based pedigree in their recent study of the Eurasian badger *Meles meles* mating system. They also found that badgers exhibit high levels of extra-group matings, attributed to inbreeding avoidance, and suggest that the tactic could help social cohesion by reducing the cost of philopatry.

Estimating inbreeding is often a specific goal for conservation studies but inbreeding measures that do not require pedigrees, such as heterozygosity, have been criticized (Balloux, Amos & Coulson, 2004). It is thus favourable to use estimation methods that use a well-resolved pedigree. For example, Charpentier *et al.* (2006) used a long-term dataset including a molecular-marker-based pedigree to estimate inbreeding and examine its correlates with life-history traits in mandrills *Mandrillus sphinx*. They found that inbreeding in females was associated with small body size and an earlier age at first conception. Clearly, inbreeding could have important effects on the dynamics of the population.

Robust estimation of evolutionary parameters like heritability and selection (see Garant & Kruuk, 2005) also requires pedigree data. These parameters can help us understand how species cope with environmental perturbation, or longer-term environmental change, and are therefore fundamental to conservation biology.

Although some parameter estimation methods require only pairwise-relatedness estimates rather than pedigrees Ritland (2000), these approaches perform poorly compared with pedigree-based methods (Coltman, 2005). Recently, the ‘animal model’ approach, which requires a pedigree, has gained favour because it allows an entire multigenerational pedigree structure, rather than simply the pairwise relationships, to contribute to parameter estimation (Kruuk, 2004).

A recent study of heritability and selection in lemon sharks in Bimini, Bahamas, demonstrates the utility of molecular marker-based pedigrees (DiBattista *et al.*, 2009). Sharks are heavily harvested worldwide, both by directly and as by-catch (Barker & Schluessel, 2005), and because this tends to remove larger individuals from the population, there may be significant selection for smaller body size (Fenberg & Roy, 2008) resulting in evolutionary change in the population. Similar changes in other fish species have been implicated in fishery collapse (Olsen *et al.*, 2004). DiBattista *et al.* (2009) found that both body mass and size, which are predictors of survival, are moderately heritable. Therefore, harvesting that targets large individuals can potentially lead to the population becoming smaller bodied, less fecund and less viable.

It is clear from these examples that molecular marker-based pedigrees can make a great contribution to our understanding of evolutionary and ecological processes in wild-animal populations, all of which have implications for conservation biology. We now offer a brief overview of pedigree inference methods.

### Relatedness

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Perhaps the simplest approach to handling these kinds of genetic data is to construct a pairwise-relatedness matrix for the population, rather than assigning discrete relationships. Relatedness coefficients measure the probability of identity-by-descent (IBD) between individuals over the whole genome and several estimators exist (Lynch & Ritland, 1999; van de Casteele, Galbusera & Matthysen, 2001; Wang, 2002). In other words they are estimates of the probability that two alleles at the same locus, one randomly selected from each individual in a pair (dyad), will have recently descended from a single ancestral allele. At any locus, dyads may share zero, one or two alleles that are IBD and the probabilities of the events depend on their relationship. In a large outbred population, monozygotic (identical) twins, for example, will share two alleles that are IBD at each locus and will have a relatedness of 1. Parent–offspring dyads will share one allele that is IBD per locus and have a relatedness of 0.5. Parent–offspring dyads and full-sib dyads have the same total relatedness (0.5) but distinguishing between them is not difficult because the pattern of relatedness is different. While a parent–offspring dyad always share one allele IBD, for a full-sib dyads, share zero, one or two alleles, with probabilities 1/4, 1/2 and 1/4, respectively. Bink *et al.* (2008) recently compared several relatedness estimators and found that they all performed approximately equally well.

The utility of relatedness estimates depends on how they are used. Although relatedness estimates of particular dyads are unlikely to be accurate with the typical marker information in current practice, the error in these estimates would be averaged away across numerous dyads in a moderate-sized sample. It follows that using relatedness estimates to address questions like, ‘are juveniles more related than adults?’ or ‘are females more related than males?’, to investigate age- or sex-biased migration are likely to produce useful results given sufficient statistical power. However, other questions require assignment to discrete relationships so, although relatedness measures are useful for addressing a range of questions it is often expedient to infer discrete relationships rather than relatedness *per se*. Several approaches exist for doing this and they can be split into two camps: pairwise and full-likelihood methods.

### Pairwise methods

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Pairwise methods consider dyads for a number of candidate relationships. When considering a particular dyad, all other individuals, and their relationships, have no influence on the inference of the focal dyad's relationship. For parentage inference, in an ideal world, we might use an exclusionary approach. Exclusionary approaches for sibship inference are limited to the case of large full-sibship groups. This is because dyads are never excludable as full-siblings in diploid species no matter how many markers are used in the analysis.

With exclusionary parentage inference several genetic markers are considered and individuals that do not share any alleles at one or more loci are excluded. In paternity analysis, for example, this would ideally leave a single candidate as the father (the principle is the same with maternity analysis). Unfortunately, incomplete candidate sampling (i.e. missing individuals) and insufficient markers mean that it is rarely possible to exclude all but one individual with certainty. In addition, genotyping errors and mutations contribute to false exclusions (Wang, 2004). Therefore, we must turn to likelihood-based methods, which assess the likelihood of one hypothesis relative to another.

Likelihood-based methods are categorized as categorical (discrete) or fractional. Categorical methods aim to assign dyads to particular relationships. In sibship inference for example they ask, ‘Given the genotypes of two individuals, what is the most likely relationship: full-sib, half-sib or unrelated?’, while in parentage inference they ask, ‘Given the genotypes, which of these candidate individuals is most likely to be the father (or mother)?’. Fractional methods, which are restricted to parentage analysis, split the relationship probabilistically among compatible individuals. With both approaches, likelihood is calculated using the rules of Mendelian segregation of alleles between parents and offspring. Equations for calculating these probabilities are available elsewhere (e.g. Thompson, 1975; Marshall *et al.*, 1998; Wang, 2004). Both the polymorphism used, and the number of molecular markers used will influence the efficacy of these approaches (Box 1).

Table Box 1. Marker selection Every marker used in an analysis contributes information. Butler *et al.*, 2004, based on the performance of a number of parentage assignment algorithms on simulated and empirical datasets, suggested that six to eight loci, with at least eight alleles each, should be used. However, the number or markers required will depend on such factors as the (unknown) family structure and marker polymorphism. Workers should select markers that maximize the amount of useful information they provide, while minimizing any increase in potential error. Increasing both the number of loci, and allelic diversity (i.e. degree of polymorphism at each locus), will tend to increase the confidence in relationship assignments [although the number of loci is more influential than their allelic diversity (Bernatchez & Duchesne, 2000)]. However, workers should note that, where the sample itself is used to estimate population allele frequencies, using highly polymorphic loci can have the side-effect of inflating the error in allele frequency estimates unless the sample size is large (Gomez-Uchida & Banks, 2005; Kalinowski, 2005). Another factor to consider is that, as polymorphism at a locus increases, so does the potential for scoring error (Buchan *et al.*, 2005; Hoffman & Amos, 2005). The addition of noise in this manner will only be a problem when using methods that do not account for error (see ‘Error and mutation’). Markers are usually assumed to be independent. Therefore, the use of tightly linked, non-independently assorting loci, introduces a pseudo-replication problem to the analysis. Failure to account for linkage results in an overestimate of precision, and therefore overconfidence in the inference made. A number of methods apply a correction to account for this problem (see main text), but for all but the most tightly linked loci the problem is likely to be minor. Lastly, for conventional methods that only consider one or two generations, the inclusion of non-neutral loci is not problematic. However, for multigenerational approaches (which, like conventional methods, assume allele frequencies are fixed) these types of markers may pose problems because allele frequencies may change systematically via selection across generations (Estoup *et al.*, 2002), and should be avoided as they will reduce the accuracy of inference. Selkoe & Toonen (2006) provide an overview of marker selection issues. |

With the categorical approach we examine the likelihoods of the putative relationships and accept the candidate relationship that is significantly more likely than any other. With the fractional parentage approach the parentage is split among every non-excluded candidate in proportion to their relative likelihood such that the one with the highest likelihood received the highest proportion and the others receive smaller proportions (summing to 1). Although the fractional approach has some advantages (i.e. it can be used even when the discriminatory power of loci is low, it uses all available data, produces a probability distribution of the pedigree), it is not as commonly used as categorical methods. Henceforth, we concentrate on categorical assignment methods.

The precise algorithms used to make categorical assignments of parental or sibship relationships vary. For parentage analysis, the simplest method is to award parentage to the individual with the highest logarithm of the likelihood ratio, or LOD score (the likelihood ratio is the likelihood of parentage of a particular individual relative to the likelihood that the individual is unrelated to the offspring in question). However, we should only award parentage if the best candidate is significantly more likely than the second best candidate (Marshall *et al.*, 1998).

We will now describe the algorithm used by the popular parentage inference program, cervus (Marshall *et al.*, 1998). cervus begins by calculating the LOD scores for all possible pairings between a focal individual and the candidate parents. Rather than simply assigning paternity or maternity to the individual with the highest LOD score it compares the LOD score of the most likely candidate with that of the next most likely candidate (by taking the difference in the LOD score, Δ). The magnitude of Δ indicates our confidence in assigning parentage to this particular individual and its statistical significance is tested by comparing with a null distribution of Δ generated from simulated parentage. If the magnitude of Δ is deemed satisfactory then parentage is assigned to that individual, otherwise it remains unassigned. The simulations carried out by cervus can also be informative of the power of a study. For example, in a pilot study one could examine the change in assignment rate with increasing number of alleles or loci, and use the information to decide how many markers to use.

Sibship assignment algorithms tend to be slightly more straightforward; usually assigning the most likely relationship of a number of candidate relationships. Some algorithms assign both full- and half-sibships while others assign only full-sibships.

### Full-likelihood methods

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Although pairwise methods have been fruitful, they discard potentially valuable information and are therefore inefficient. In addition, they can result in incompatibilities in the pedigrees produced. For example, when examining the relationship between individuals A, B and C, the dyads A–B and A–C may be inferred as full-sibs, while B–C may be inferred as half-, or non-sibs. In a parentage analysis the dyads A–C and B–C may be inferred as father–offspring and mother–offspring, respectively, but when considered jointly, the the trio A–B–C may be revealed as incompatible with a parent-pair and offspring relationship.

A major advance in the field, to which we now turn, was the development of full-likelihood methods which are more accurate than pairwise methods (Thomas & Hill, 2002; Wang, 2004). Several algorithms exist (Emery *et al.*, 2001; Smith, Herbinger & Herbinger, 2001; Thomas & Hill, 2002; Wang, 2004; Wang & Santure, 2009) but their common feature is that, unlike pairwise methods, they retain the information lost when individuals other than the focal pair are ignored. For example, in parentage assignments, a single offspring only provides information for a single allele at a parental locus. However, with more offspring in the set considered (e.g. a group of siblings), the probability that both parental alleles are represented is increased, and consequently the power of parental assignment is improved. Another major benefit is that relationship incompatibilities of the kind described above are avoided.

The methods aim to reach a solution that maximizes likelihood of the entire pedigree configuration given the marker data. There are an astronomical number of potential configurations to test, even for small samples of individuals, and the approach is computationally intensive. Therefore, techniques such as Markov Chain Monte Carlo or simulated annealing are used to explore parameter space for all possible pedigree configurations to find a solution that maximizes likelihood. Parameter space is traversed with certain rules, aiming to go to areas with higher likelihood values, but avoiding getting stuck in local maxima (Wang & Santure, 2009).

Full-likelihood methods are best-suited to populations that have large family groups; for example, populations with large full-sibships, or that are highly polygamous (Wang & Santure, 2009). For practical purposes, the family group size depends both on the actual genetic structure of the population and on the sampling regime used to collect the data. As family group size declines, the amount of extra information exploited by full-likelihood methods compared with pairwise-likelihood methods also declines and the computational challenges posed by analysing large datasets with little family structure (e.g. Almudevar, 2007) may eventually outweigh the increased accuracy of the method.

Most full-likelihood approaches consider either sibships, or parentage, not both. However, the algorithm presented by Wang & Santure (2009) allows the joint inference of both parentage and sibship and as such represents a major advance in pedigree inference.

### Linkage equilibrium

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

Usually, methods assume that genetic markers are unlinked between loci, and in linkage equilibrium. When two loci are close together on a chromosome, alleles at the loci may not assort independently, and will tend to be transmitted to the offspring as a pair. Although loci may not be linked functionally, they are clustered on the genome (Bachtrog *et al.*, 1999) and as researchers use more markers in their analyses [e.g. single nucleotide polymorphisms (SNPs), below] significant linkage becomes increasingly likely (Abecasis & Wigginton, 2005) and must be tested, and perhaps accounted, for.

Linkage between loci, and the resulting non-independence of the information they provide, is essentially a statistical pseudo-replication problem for relationship inference. Although the estimate may not be biased the errors around this estimate would be inflated. Therefore, to avoid Type 1 error (false rejection of the null hypothesis), one of the pair of linked loci should either be discarded or down-weighted in the analysis. Fortunately, methods to account for linkage exist, mainly relying on an estimated linkage map depicting the strength of association between pairs of loci (Epstein, Duren & Boehnke, 2000; McPeek & Sun, 2000). The quantitative accuracy (i.e. recombination rate between loci) of such a map is not as important as the qualitative accuracy (i.e. relative positions of markers on a chromosome).

Linkage is not all bad news though: linkage between loci provides additional information that can be used to distinguish between certain relationship types (Boehnke & Cox, 1997). In addition, for specific relationship–sex combinations, the inclusion of X-linked markers can improve our ability to correctly infer relationships, for example by allowing the improved differentiation of second degree relationships such as cousins (Epstein *et al.*, 2000).

### Error and mutation

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

The final assumptions are that genetic data are free of error and mutation. Although these are two separate assumptions, error and mutation are usually not distinguishable and the effects on the analysis are the same. We therefore deal with them together here. The assumptions are universally violated (Bonin *et al.*, 2004; Wang, 2004; Hoffman & Amos, 2005; Soulsbury *et al.*, 2007) because mutations are widespread and errors cannot be totally eliminated. For microsatellites, errors include allelic dropouts (where PCR fails to amplify one of an individual's two homologous genes, one from each parent, at a locus; Dakin & Avise, 2004), false alleles (polymerase errors rendering an allele other than the true one), miscalling (allele identification error), contaminant DNA and data entry error (Dakin & Avise, 2004; Wang, 2004). The presence of such errors can present apparent failures of Mendelian inheritance and lead to incorrect relationship assignments, departures from Hardy–Weinberg equilibrium, overestimated inbreeding, etc. Even minor errors may lead to the incorrect classification of a monozygotic twin relationship as a full-sib. Butler *et al.* (2004) showed that sibship algorithms that follow strict Mendelian inheritance rules, are not robust to most kinds of errors. In fact, they show that errors can cause >70% of individuals to be misclassified. This underlines the importance of selecting methods that are robust to error.

Several workers provide methods to identify and cope with errors in pairwise parentage inference (e.g. Marshall *et al.*, 1998; Kalinowski, Taper & Marshall, 2007) and full-likelihood sibship and parentage inference (e.g. Wang, 2004; Wang & Santure, 2009). Marshall *et al.*'s (1998) approach was to include an error rate parameter to account for imperfections in the data, while Wang's (2004) approach explicitly models error by distinguishing between observed and actual genotypes while estimating the likelihood of a particular pedigree configuration. The approach can identify and account for genotype errors (or mutations) at each locus of each sampled individual.

### Inclusion of field observation data in the genetic framework

- Top of page
- Abstract
- Introduction
- Uses in animal conservation studies
- An overview of pedigree inference
- Relatedness
- Pairwise methods
- Full-likelihood methods
- Common assumptions of pedigree inference methods
- Complete sampling of population
- Large and randomly mating population
- Linkage equilibrium
- Error and mutation
- Recent developments
- Inclusion of field observation data in the genetic framework
- Conclusion
- Acknowledgements
- References
- Supporting Information

The ultimate goal of most studies that infer pedigrees is to estimate a population-level parameter, rather than generating the pedigree itself. Categorical methods result in a pedigree that is assumed to be true, and uncertainty is usually ignored. Fractional methods result in a probability distribution of a pedigree and thus take uncertainty into account.

However, both fractional and categorical methods suffer because the processes of pedigree inference and parameter estimation are divorced from each other. Without modification, estimates of population-level parameters are biased towards estimates that would be expected under random mating (e.g. the naïve prior that all fathers are equally likely to be the true father). Adjustment of priors to correct for this (e.g. Neff, Repka & Gross, 2001) is one approach to cope with this, but a novel approach is to estimate population-level parameters jointly with the pedigree inference. Hadfield, Richardson & Burke (2006) illustrated this concept for paternity inference with the example of a fictional study with 20 candidate fathers and 20 offspring, where each father is the social father of one offspring (i.e. it behaves as the father, even though it may not be the true father). For 19 of the offspring, the genetic data support the social father as the true father. However, for the remaining individual, the genetic data give equal support for the social father and an unknown male (i.e. a potential EPP). Using traditional methods, support for the two potential fathers remains equal. However, using their novel method, support for the social male is greater because the data indicate that the social male is inherently more likely to be the true father.

This method can be adapted so that any other population-level parameter can be estimated simultaneously with the pedigree, and can therefore contribute to paternity inference. It is easy to envisage a case where geographic distance could contribute to the pedigree estimation. Simulations show that their approach is more accurate than the simple categorical approach that would be implemented by, for example, cervus (Hadfield *et al.*, 2006). We envisage that this method, implemented in masterbayes (Hadfield *et al.*, 2006), an r package (R Development Core Team, 2008), could be broadened in scope to include sibship inference within the same framework.