### Summary

- Top of page
- Summary
- 1 Introduction
- 2 Likelihood-Based Relationship Estimation
- 3 The Sample Space and the Prior
- 4 Examples
- 5 Discussion
- Acknowledgements
- References

The objective of this paper is to show how various sources of information can be modelled and integrated to address relationship identification problems. Applications come from areas as diverse as evolution and conservation research, genealogical research in human, plant and animal populations, and forensic problems including paternity cases, identification following disasters, family reunions and immigration issues. We propose assigning a prior probability distribution to the sample space of pedigrees, calculating the likelihood based on DNA data using available software and posterior probabilities using Bayes' Theorem. Our emphasis here is on the modelling of this prior information in a formal and consistent manner. We introduce the distinction between local and global prior information, whereby local information usually applies to particular components of the pedigree and global prior information refers to more general features. When it is difficult to decide on a prior distribution, robustness to various choices should be studied. When suitable prior information is not available, a flat prior can be used which will then correspond to a strict likelihood approach. In practice, prior information is often considered for these problems, but in a generally ad hoc manner. This paper offers a consistent alternative. We emphasise that many practical problems can be addressed using freely available software.

### 1 Introduction

- Top of page
- Summary
- 1 Introduction
- 2 Likelihood-Based Relationship Estimation
- 3 The Sample Space and the Prior
- 4 Examples
- 5 Discussion
- Acknowledgements
- References

There are many situations which require determination of the true pedigree connecting a given set of individuals from genetic marker data. A small pedigree is usually sufficient to describe the relationship between people involved in a paternity case (Essen-Möller, 1938). Similar or slightly more complicated structures are typically required to describe immigration or family reunion cases (Hansen & Morling, 1993), and identification problems following disasters could potentially require quite large pedigrees (Gill *et al.* 1994; Olaisen *et al.* 1997). Relationship identification is also important in animal and plant applications. Bowers *et al.* (1999) used parentage analysis to study the origins of wine grapes from Northeastern France, for example. Meagher & Thompson (1987) applied pedigree reconstruction methods to identify maternal and paternal parents of seedlings in a natural population to assess pollen dispersal patterns and evaluate reproductive success. Jones & Ardren (2003) reviewed methods for parentage assignment in natural populations and Blouin (2003) reviewed existing methods for relationship estimation from different areas of application and discussed their relevance to wild populations. Clearly, these problems can all be described in terms of determining the most likely pedigree from a set of possible alternatives although this is not usually made explicit.

Misspecification of relationships can severely distort the results of a genetic linkage analysis. Similarly, genetic association studies can be affected if there are close relationships within or between cases and controls that were not specified at recruitment. Much of the literature on inferring relationships for humans concentrates on pairwise relationships and uses genome-scan data, either with a view to correcting for errors prior to a linkage/association study or to identify sets of individuals likely to share longer haplotypes around susceptibility alleles (Boehnke & Cox, 1997; Göring & Ott, 1997; McPeek & Sun, 2000; Stankovich *et al.* 2005). We wish to focus more generally on specifying the relationships amongst several individuals and will usually expect data for no more that 15-20 autosomal microsatellite marker loci. More importantly, we will typically be interested in determining the true relationship for these particular individuals, rather than obtaining an overall idea of the graph structure connecting them. Our focus is hence less on large pedigree reconstructions of genetic isolates where the latter emphasis would typically apply, and more on forensic problems, wildlife population management applications or, perhaps, construction of pedigrees for linkage analysis from a large population study.

Reconstruction of the pedigree connecting a set of individuals from genetic marker data using likelihood criteria dates back over three decades (Thompson, 1974,1975,1976a). Besides the marker data, there may sometimes be additional information which is frequently used in practice, although it is not always incorporated within a formal modelling framework. Ambiguities due to symmetries in likelihoods for pairwise relationships can often be resolved, for instance, by using age information to distinguish between the parent and offspring when such a relationship is indicated by the genotype data, and by using sex information to establish whether it is a maternal or paternal relationship (Thompson, 1975). Three pairwise relationships, halfsibs, grandparent-grandchild and aunt-niece, were discussed by Thompson (1986) as indistinguishable on the basis of any amount of data at independently segregating loci. In the absence of relevant non-genetic information, extra DNA data such as that provided by haploid data (i.e. mtDNA or Y-chromosome data), or information on additional relatives (Sieberts, Wijsman & Thompson, 2002) or linked markers (Thompson & Meagher, 1998) may resolve the problem.

Most interesting identification cases cannot be resolved without some kind of extra information. At least 50 unlinked microsatellite markers are required to distinguish between a pair of half-siblings and unrelated individuals (Blouin, 2003) although this number can be approximately halved when maternal profiles are also available (Mayor & Balding 2006). For our purposes, any information that is in addition to the DNA marker data is defined as “prior” information. In practice, such information is not always thought of as prior information, either because it is often incorporated in an informal way, or because of existing prejudices against a Bayesian inferential process. Our approach can be seen as an extension of Egeland *et al.* (2000), as implemented in the freeware package Familias (http://www.nr.no/familias), by allowing for prior knowledge about parts of the structure to be specified and introducing the idea of distinguishing clearly between different sorts of prior information in terms of practical implementation. It comprises four stages: define a sample space of pedigrees assign a prior probability distribution on this space, calculate the likelihood of the DNA data for each pedigree, and compute posterior probabilities using Bayes' Theorem. The emphasis in this paper is on the first two steps. This four-step procedure allows for a practical implementation whereby each step can be performed separately using whatever tools are available. We show that many of the existing approaches to inclusion of extra information can be viewed as special cases of the simple structured prior function we propose. Of course, a single, albeit general, prior is unlikely to meet the needs of all users so it is important that it can be conveniently tailored to a specific application. The prior function we propose has this facility and can be easily modified.

The main advantage of modelling additional information on an identification problem in a formal way is that all the relevant information is stated “up front” as part of the prior function. Hence, it can be incorporated in the most efficient and appropriate way and not simply be brought in to resolve a particular dilemma when a standard likelihood procedure reaches an impasse. This paper illustrates how many apparently diverse contributions to the literature on estimating relationships can be brought together successfully. As noted by Blouin (2003), the animal genetics literature often seems unaware of relevant developments in human genetics but the reverse argument could also be made. The forensic science literature adds yet another dimension. Since all are working on closely related problems for which many of the key ideas were detailed thirty years ago, it is important to combine the expertise in these areas so that such issues do not continue to be addressed in parallel. The outline of this paper is as follows. We will begin with a brief review of likelihood-based pedigree reconstruction from DNA data and discuss some approaches to incorporating relevant additional information. We will then introduce our prior function and show how many existing approaches from hitherto quite separate sources can be viewed as special cases. We will finally discuss the advantages and limitations of this modelling approach together with illustrations of how it can be used on real and simulated data.

### 2 Likelihood-Based Relationship Estimation

- Top of page
- Summary
- 1 Introduction
- 2 Likelihood-Based Relationship Estimation
- 3 The Sample Space and the Prior
- 4 Examples
- 5 Discussion
- Acknowledgements
- References

In theory, estimating the pedigree for a given set of individuals from genetic marker data is simple: all one has to do is consider all possible joint relationships amongst them and compute the likelihood for each (Cannings & Thompson, 1981). However, in a captive breeding programme, for example, precise estimation of the relationships between the existing animals who will be the founders of the future population might require consideration of a vast number of possible alternatives. In practice, brute force enumeration is not always practical and a different approach is required. One such alternative is provided by the sequential algorithm of Thompson (1976a) which arrives at a single reconstruction by starting from a position where all individuals are assumed to be unrelated and gradually accepting sibships on the basis of the increase attained in log-likelihood, or *support*. This method is most successful in reconstructing pedigrees connecting a set of closely related individuals and tends to favour large sibships. In particular, it is assumed that the parents of each individual in the sample of interest are either included in the sample themselves, or else are unrelated to any other members of the sample. In general, highly polymorphic loci are better for excluding false parent-offspring links, but large numbers of loci are more informative in terms of specifying the relationship since it is the proportion of loci at which individuals have alleles in common that is relevant.

Comparing two pedigrees using likelihood criteria will depend on the allele frequencies assumed for the founders. Rare alleles also tend to give a high likelihood to gene identity by descent and more inbred structures could be heavily favoured in extreme cases. The joint relationship between several individuals determines the probabilities of genetically distinct classes of gene identity states (Thompson, 1974). Maximum likelihood estimates of these probabilities, and hence of the relationship, can be obtained using standard methods. Hence, amongst all possible alternative relationships, the true relationship is one of those that maximise the expected log-likelihood “regardless of the number of individuals, the complexities of their relationship, the number of loci for which data are available and the frequencies of alleles and dominance systems exhibited by these loci” (Thompson, 1976b). Reconstruction methods, such as the sequential procedure outlined above, build up an estimated structure that has high overall likelihood from subunits. The focus here is on those subsets of individuals giving maximum likelihood increase for the acceptance of a given relationship, relative to the alternative that they are unrelated. For example, which of all non-excluded parent pairs are the most likely parents of *A*? The true parents will not necessarily maximise the expected log-likelihood. In fact, a sibling will often have a higher expected log-likelihood for the parent hypothesis than a true parent (Thompson, 1986). This will remain the case, however many loci are considered. How important this is will depend on the application. If the main focus of a reconstruction is on the overall shape of the pedigree, with a view to gaining anthropological information on age and mating patterns, for example, finding a highly likely pedigree will generally suffice. If the focus, however, is on precise identification of specific relationships, such as in a forensic setting, the set of all possible relationships amongst the given individuals is the correct set of alternatives to consider.

Cannings & Thompson (1981) also suggested that a pedigree can be represented using cluster analysis or multidimensional scaling methods (Thompson, 1974) on genetic distance measures, but argued that such approaches rarely provide clarification of the relevant pedigree structure. Cowell & Mostad (2003) combined clustering and likelihood approaches by defining a likelihood-ratio-based distance measure of pairwise relatedness and clustering individuals using an estimate of this measure.

The sequential algorithm of Thompson (1976a) exploits age and sex information. Age data, or at least an age ordering, are required initially to sort individuals by descending maternal age. Age data are also used for generation gap restrictions in that mothers are constrained to be between 15 and 50 years older than any of their offspring, while fathers must be between 15 and 75 years older. Finally, age data are used to assess the plausibility of selected sibships. Almudevar (2003) considered maximum likelihood pedigree reconstruction via a simulated annealing algorithm that begins with an enumeration of parent-offspring triplets and assembles them into sets of admissible pedigrees on which the likelihood is easily maximised. Age and sex data, although easily incorporated, are not specifically required and the method appears successful on reasonably large pedigrees (one example had 69 individuals) provided there is sufficient kinship structure. As with the sequential approach described above, however, this algorithm assumes what the author called a “complete” sample in which parents of each individual are either included in the sample themselves or else are unrelated to any other individuals in the sample. Knowledge about mating patterns can also be incorporated in likelihood approaches to pedigree reconstruction. For human populations, Thompson (1976a) noted that social prohibitions on polygamy, for example, are usually restricted to concurrent, rather than consecutive matings: age distributions of the relevant offspring groups can be used to distinguish between the two. Prodohl *et al.* (1998) used natural history information, including lactating status of females and spatial positioning of parents and offspring to refine parentage inferences from genetic likelihoods on a population of armadillos.

The term “prior” immediately introduces the notion of Bayesian reasoning which has had a very mixed reception in legal circles. Indeed, the UK Court of Appeal appears to have ruled Bayes' Theorem to be inadmissible as evidence. (See Balding (2005) for a recent overview.) The strength of the Bayesian argument lies in its facilitation of discussion via probabilistic statements rather than hypothesis-testing. Despite its obvious relevance, there has not been much discussion of formal use of prior information in forensic applications, although attention has been given to the consideration of mutation rates and the deficiencies associated with a single alternative hypothesis (Evett & Weir, 1998; Dawid *et al.* 2001; Vicard & Dawid, 2004; Balding 2005). However, despite the reluctance to use Bayesian inference in these settings, Essen-Möller's well-known *W* or “Wahrscheinlichkeit” statistic has a straightforward Bayesian interpretation. Consider the two standard hypotheses in paternity cases:

where by “random” we mean that the individual's genes are randomly drawn from the population gene pool. The “W” statistic is defined as

- (1)

where *LR* is the relevant likelihood ratio or *paternity index*

- (2)

Although a Bayesian argument was not explicitly used, Essen-Möller (1938) was aware that equality in (1) assumed equally likely competing hypotheses. Practice differs between forensic laboratories as to whether they report the paternity index, or W, or both (Egeland & Mostad, 2002). It has been argued that although both contain the same information, the interpretation of *W* as the (posterior) *probability of paternity* (i.e. posterior probability of *H*_{1}) may make it less abstract and less open to misinterpretation than a likelihood ratio (Hummel, 1984). Nonetheless, it is often avoided in practice, due to the indirect assumption of equally likely hypotheses (Evett & Weir, 1998).

The standard analysis in which only the likelihood ratio, or paternity index, is considered is restricted to the comparison of pairs of alternatives and the “random man” alternative hypothesis, in particular, suffers from well-known deficiencies (Goldgar & Thompson, 1988Balding, 2005). Moreover, as has been noted by several authors, these likelihood ratios may differ dramatically for different choices of alternative hypotheses and it is not obvious how to summarise calculations for different such pairs (Egeland *et al.* 2000). Despite the fact that it is usually the likelihood ratio that is calculated, the questions that are asked in many forensic applications often expect (and mistakenly interpret) an answer in the form of a probability statement. (This is related to discussions in Evett & Weir (1998) on the “transposed conditional” problem.) If paternity probabilities are required, proper posterior probabilities must be calculated and prior distributions must hence be specified. Consider the completely general case with n competing hypotheses *H*_{1} … , *H*_{n} having prior probabilities π_{1,} … , π_{n}, respectively. Note that the original definition in (1) is equivalent to assigning the values corresponding to a flat prior. Let *L*_{i}≡*P*(*data* |*H*_{i}). By Bayes' Theorem, the posterior probability of *H*_{i} is

- (3)

Although, posterior probabilities are more meaningful, pairwise comparisons can still be made for standard paternity analyses using posterior probability ratios. Note that the interpretation of such ratios raises issues similar to those discussed for likelihood ratios: not only are they restricted to pairs of alternatives but they also depend on the choice of prior. For any pair of hypotheses, *H*_{i} and *H*_{j} we hence have that

- (4)

expressing the posterior probability ratio on the left hand side as the product of the likelihood ratio, *L*_{i}/*L*_{j}, and the prior ratio, π_{i}/π_{j}. When there are only two competing alternatives, (4) is popularly known as the *odds form of Bayes' Theorem* (Evett & Weir, 1998). It is clear from the above representations that the likelihood calculations can be considered separately from the prior probability assignments. From a practical viewpoint, externally assessed prior information can hence be incorporated into an analysis where existing software is used to calculate the relevant likelihood ratios (Mortera, Dawid & Lauritzen, 2003).

Many authors have taken a Bayesian approach to the problem of relationship estimation. Gill *et al.* (1994) addressed the identification of the remains of the Romanov family where aristocratic origins were implied by indications of gold, platinum and porcelain dental work in some of the bodies. They showed how this piece of evidence can be combined with mtDNA using the odds form of Bayes' Theorem, as described above. A likelihood-based approach to the problem of confirming pairwise relationships in sib pairs prior to conducting a linkage analysis was considered by Göring & Ott (1997). The aim was to increase the power to detect linkage by eliminating false sib-pairs. The focus is thus on distinguishing between sibs, half-sibs and unrelated individuals, as these are argued to be the most likely alternatives, given the reasons for which they were recruited. Prior probabilities are assigned to the three types of relationship based on knowledge of laboratory error rates and population rates of non-paternity and adoption. Posterior probabilities based on these priors and the likelihoods from the genetic markers are then calculated using Bayes' Theorem. Thompson & Meagher (1987) placed prior probabilities on specific relationships between pairs of individuals for inferring parentage. Neff, Repka & Gross (2001) considered a Bayesian approach to calculating “expected” rather than most likely parentage, by incorporating additional biological information, such as behavioural observations during mating, in a prior distribution on parentage vectors. They demonstrated that assuming this prior distribution to be uniform can lead to very misleading results. Goldgar & Thompson (1988) considered the standard paternity-testing problem using a Bayesian interval estimation approach. They reformulated the problem as one of estimating the genetic relationship of the putative father (i.e the tested individual) to the true biological father of the child, and so avoided the usual interpretational problems associated with the standard paternity index. A beta prior probability distribution is placed on the coefficient of relationship (Wright, 1922) between these two individuals and a posterior interval estimate obtained using numerical integration.

### 3 The Sample Space and the Prior

- Top of page
- Summary
- 1 Introduction
- 2 Likelihood-Based Relationship Estimation
- 3 The Sample Space and the Prior
- 4 Examples
- 5 Discussion
- Acknowledgements
- References

Since, our focus is on finding the most likely pedigree connecting a set of “observed” individuals for whom we will typically have DNA marker data, sequential hill-climbing approaches that could stick at local maxima are not ideal. Moreover, we do not want to be restricted to the assumption that parents of each individual in the sample are either included themselves or else are unrelated to any other individual in the sample. The desire to consider many alternatives is also a complicating feature for our applications.

We would thus wish to consider the correct set of alternative hypotheses by considering all possible pedigrees but, as has been noted above, this could be a formidable task. A less specific alternative to the “random man” hypothesis for the paternity example above would be: “Some other man is the father”. This, if taken literally, corresponds to an impractically large set of alternatives. In reality, of course, the set of alternative fathers for a specific individual is far from infinite and one important use of prior information is to reduce this set to a manageable size by excluding implausible alternatives. Thompson (1976a) recommended exclusions based on particular demographic features, provided the aim of the reconstruction is not to investigate any aspects of such features. We will distinguish between *hard* prior information that we will know with certainty, and *soft* prior information about which we are only willing to make probabilistic statements. While the latter can favour or downweight particular features, the former can be used to restrict the set of possible alternatives thus making the consideration of “all possible alternatives” a realistic option. Note that the success of the approach presented here depends on being able to generate this sample space of alternatives.

#### 3.1 The sample space

It is also helpful to distinguish between *global* prior information relating to general knowledge about the population or species in question, such as information on breeding patterns, mating behaviour, average numbers of offspring, cultural prejudices etc. and *local* prior information which is particular to the application at hand and relates to specific parts of the pedigree or specified individuals. Whatever algorithm is used to generate the sample space, many structures will be automatically created which are unlikely candidates for various reasons. For instance, many inbred pedigrees will arise which may not be appropriate for a specific human application and may have a high likelihood if not penalised in some way, especially if rare alleles have been observed. Social and breeding patterns vary widely from one species to another and from one subpopulation to another within a species. Individuals can be categorised as “adults” or “juveniles” either according to their ages or information on whether they have offspring or not. They are deemed to be adults (and hence parental candidates) in the absence of suitable information to the contrary. Sometimes, hard local and global information can combine to significantly reduce the number of possibilities. For example, there are 6720 possible pedigrees comprising two males and three females: declaring one female as a juvenile reduces the number to 2817 whereas if one male is known to be a juvenile, it reduces further to 2128 possibilities. Knowledge on generation gap, such as the bounds on maternal and paternal age differences suggested by Thompson (1976a) (Section 2) would give a further reduction.

For many applications, pedigrees involving “unobserved” individuals, i.e. individuals that are not in the original group and for whom we have no genetic data, will be required. For example, for a group of four individuals to comprise a female and her three offspring by the same male, the unknown male must be included in order to describe the relationship, even though there is no other information on him. Several additional individuals might be required to describe more distant relationships. In a sequential approach to a single reconstruction, such additional individuals as described above can be incorporated as unobserved founders and represented simply by the genes they contribute which are assumed to be randomly drawn from the population gene pool. Although we might sometimes wish to hard-wire their relationships with observed individuals, we might not necessarily wish to restrict them to founding positions. Besides, the possibility that these additional individuals could connect the observed set in another way may also be of interest. In this case, they are labelled and added to the observed set and all plausible pedigrees containing the originals plus the extra unobserved ones will be considered, so creating a (considerably) enlarged sample space. An inheritance claim would be an example where this might be relevant. Inheritance laws vary from country to country and will only accept relationships up to a particular degree in order to honour the claim. The number of extra individuals required to cover all possible acceptable relationships between the claimant and the deceased poses an interesting question. Note that there is an upper limit to the degree of relationship that can be detected by any approach based on identity-by-descent (as is implicit in the likelihood calculations here) since the probability that two related individuals do not inherit *any* autosomal DNA from their closest common ancestors increases rapidly with increasing degree of relationship (Donnelly, 1983).

All possible pedigrees comprising the final set of individuals can be generated, for example, by listing all possible parent-offspring links and incorporating various consistency checks to ensure that this is a valid set of structures. Hard local and global prior constraints should be introduced at this stage to reduce this set. Realistically, it will not always be sensible to list all reasonable pedigrees at the outset. Enumeration and evaluation of pedigrees could take place sequentially with an algorithm for efficient retrieval of structures with high likelihood and posterior probabilities at the end of the process. For large problems, Markov chain Monte Carlo methods could be considered to sample from this space. Some classification of subspaces of pedigrees might also be appropriate which would enable exploration of the sample space via the separate sub-classes.

#### 3.2 The prior function

Once the sample space for *n* individuals (some of whom may not be observed) has been determined, a prior probability *Pr*(*g*) is assigned to each pedigree *g* of the following form:

- (5)

where

is the normalisation constant. *M*_{1}, …, *M*_{s} are non-negative *global* parameters that allow pedigrees to be weighted according to *s* specified characteristics. The integer exponent *b*_{i}(*g*) corresponding to *M*_{i} provides a particular measure of that characteristic, is internal to pedigree *g* and thus provides the degree of the relative weightings of different pedigrees for the *i*^{th} characteristic. For example, if the *i*^{th} characteristic is inbreeding, *b*_{i}(*g*) would be a measure of the extent to which pedigree *g* is inbred according to how inbreeding has been defined. Thus, *b*_{i}(*g*) = 0 might mean that *g* has no inbreeding and increasing values of *b* might correspond to increasing levels of inbreeding. The R-parameters are for local specifications and, as given in (5), allow incorporation of prior information on parent-offspring links. We define *o*_{jk}(*g*) = 1 if *j* is the parent of *k* in pedigree *g* and *o*_{jk}(*g*) = 0 if *j* is *not* the parent of *k*. For consistency, we must have *o*_{jk}(*g*) +*o*_{kj}(*g*) ≤ 1 for *j*≠*k* in any pedigree *g*.

If we set

only pedigrees, *g*, with *b*_{i}(*g*) = 0 will be allowed. This is equivalent to setting *M*_{i}= 0 and defining 0^{0}≡ 1 and, in the example above, would eliminate all inbred pedigrees. If *M*_{i}= 1, all pedigrees with feature *i* receive equal weighting from the prior regardless of their respective *b*-values. Setting all M-parameters to 1 amounts to assigning a flat prior, whereby all generated pedigrees have equal probability a priori and there is no penalty associated with any particular feature. A value between 0 and 1 decreases the prior probability of the relevant characteristic while a value exceeding 1 increases it.

The local parameters *R*_{jk} can be interpreted analogously. Thus *R*_{jk}= 0 rules out the pedigrees featuring *j* as a parent of *k*, while *R*_{jk} > 1 would give favourable weighting to such structures. For example, if it is believed that individual 1 is the parent of 2, we would set *R*_{12} to a value larger than 1 to favour those structures, *g*, with *o*_{12}(*g*) = 1. Additional information revealing 2 to be a juvenile would allow us to exclude all possibilities where 2 is the parent of 1 so *R*_{21}= 0 disallows all structures *g* where *o*_{21}(*g*) = 1. Information on three-way relationships might sometimes be available and appropriate parameters could, in principle, be defined. However, parent-offspring links are probably the most useful in practice as the consistency checks become more complicated with increasing orders since three-way specifications must concur with all the pairwise ones and so on.

Note that while both global and local parameters, as defined here, can exclude, downweight or favour certain characteristics, they can never assign certainty in the sense that *only* particular structures will be considered. Hard prior information in the form of certainty such as “definitely inbred” or “1 *is* the mother of 2” should always be incorporated at the outset when the sample space is generated. From a practical viewpoint, although information such as age and sex of individuals can naturally be modelled as local specifications, either implicitly in the calculation of *o*_{jk}(*g*) or explicitly as indicator variables, such information is often better employed to restrict the set of possible pedigrees down to a manageable size.

#### 3.3 Interpreting the prior function

The general prior function in (5) potentially includes a large number of parameters. However, for many problems, most of these will be set to unity so it is not as cumbersome in practice as it might appear. Choosing non-trivial values for *M* and *R* parameters is not straightforward, however. For an animal population with very high levels of inbreeding, an M-parameter greater than 1 would be appropriate although it is not always obvious what value would be most relevant to the application and some experimentation with different values would be required. A straightforward interpretation of the prior is provided by considering a standard situation where comparison between two specific pedigrees, *g*_{1} and *g*_{2}, is of interest via the posterior probability ratio. From Equation (4), the corresponding ratio of the prior probabilities for *g*_{1} and *g*_{2} is the amount by which the likelihood ratio obtained from the DNA data should be adjusted by the prior, or non-DNA, information (See Section 4).

The particular characteristics that may be of interest can often be defined in several ways and this will affect the interpretation of the prior. For instance, the software Familias has a built-in prior with three pre-specified global parameters corresponding to inbreeding, promiscuity and generation number. Promiscuity is measured in terms of departure from monogamy and *b*_{P}(*g*) is defined as the number of pairs of offspring in pedigree *g* with one common parent (i.e. half-siblings) who is also in *g*. Figure 1 depicts two quite different situations for which this promiscuity b-value would be identical. One might wish to distinguish between the overall number of departures from monogamy and the degree of polygamy in some applications. The value of *b*_{I}(*g*) provides some measure of the degree of inbreeding in pedigree *g*, and is defined as the number of offspring in pedigree *g* for whom both parents are represented in the pedigree and who themselves have a common ancestor also present in the pedigree (Egeland *et al.* 2000). Alternatively, one could allow for some inbreeding, while excluding unacceptable levels, by calculating kinship coefficients for each marriage pair and excluding pedigrees with maximum kinship exceeding some pre-specified limit.

While a global M-parameter or a local R-parameter value of greater than 1 will favour pedigrees with a particular characteristic, whether or not it will distinguish between pedigrees exhibiting differing degrees of that characteristic will depend on how the pedigree-specific exponents (b-values) are defined (as shown in Figure 1, for example). For instance, suppose we wish to impose an upper bound on litter or sibship size and we define the relevant index *b*(*g*) to be 1 or 0 according to whether the maximum number of offspring of mating pairs exceeds this limit or not. Pedigrees with all sibships of acceptable—but equal—size will receive the same prior weighting as more realistic structures. Extra penalties would have to be imposed to make such distinctions. However, some care has to be taken here due to the multiplicative form of the prior (5). Whether or not a pedigree is inbred, for instance, will typically not be independent of specific parent-child relationships. This would undoubtedly become more problematic if there were several parameters relating to the same feature (e.g. sibship size), as suggested above. However, the dependencies between desirable global and local features of a pedigree are not easy to model.

This prior function is mathematically convenient, straightforward to implement and is particularly tractable in the consideration of ratios for standard forensic analyses, as we noted in Section 3.2. It is also extremely flexible and, although far from perfect, provides a simple approach to the integration of essential non-DNA information. It also includes many current approaches as special cases. The multiplicative prior function for a parentage vector described in Neff *et al* (2001) is one such candidate being a product over potential parentage assignments and different biological traits. Likewise, the prior function of Göring & Ott (1997) could be incorporated in the form of three global parameters, one for each of the three relationships considered, whereas that of Thompson & Meagher (1987) takes the form of a local specification as it relates to specific pairwise relationships. Priors on parameters, such as the beta prior on the relationship parameter of Goldgar & Thompson (1988) do not fall into this framework if they cannot be interpreted as a prior distribution on pedigree structures. Parametric downweighting, however, such as the negative quadratic support on sibship size mentioned by Thompson (1976a) can be easily incorporated. Note that a similar effect can be achieved by assigning a value between 0 and 1 to the relevant global M parameter. Assessment of the number of offspring in the pedigree, as suggested by Thompson (1986) can be achieved, either by relating numbers to an individual, or to a couple. One possibility is to let *b*(*g*) be the number of offspring for a marriage couple. Alternatively, let *b*(*g*) = 1, if the maximum number of offspring for an individual in the pedigree exceeds a specified limit and 0 otherwise. The corresponding *M* parameter can be used to downweight or favour alternatives in the usual way. Similarly, the restrictions on polygamy and polygyny of Thompson (1976a) can easily be incorporated as global parameters with appropriate definitions of the relevant *b* values, together with a local indicator variable for sex.

### 5 Discussion

- Top of page
- Summary
- 1 Introduction
- 2 Likelihood-Based Relationship Estimation
- 3 The Sample Space and the Prior
- 4 Examples
- 5 Discussion
- Acknowledgements
- References

For most relationship identification applications, we are rarely in a situation in which we know absolutely nothing besides the DNA marker data so it makes sense to consider a Bayesian approach to the problem. For forensic applications, the necessity for such an approach is implicit for the results that are often required in practice, such as paternity probabilities (Section 2). Prior information is often used in practice but is frequently incorporated informally at an interim stage of an analysis, such as when a likelihood approach produces what is clearly an unfavourable answer. This paper stresses the importance of stating all relevant information at the outset so that it can be integrated as efficiently as possible in a formal and transparent way.

As noted by Thompson (1975), finding the most likely relationship amongst a set of individuals is not the same statistical problem as identifying the most likely individuals for a specific relationship. No matter how much information is available, the latter does not necessarily assign individuals to the true relationship as the true relationship may not be among the alternatives considered. However, consideration of all possible pedigrees connecting the individuals of interest, is a formidable (and sometimes impossible) task in general. The approach of Egeland *et al.* (2000) attempts this by brute force enumeration of all possible alternatives but is restricted to very small problems and has been used only for forensic science applications. Besides extending the prior function to incorporate any number of global features and local parent-offspring relationships, we have shown that hard prior information to which we can attach certainty (e.g. there is *no* inbreeding or *A**is* the mother of *B*) can play a vital role in reducing the set of alternatives to a manageable size, thus making such an approach tractable. Alternatively, efficient ways of generating and exploring the search space could be investigated.

The prior function (5) of Section 3 defines a prior distribution on pedigrees, rather than on model parameters, and has obvious limitations: the multiplicative form may not always appear reasonable and there are no general guidelines for selecting values for the *M* and *R* parameters, besides the simple options of 0 and 1. The effects of any prior will be diluted in the presence of a lot of data but priors can potentially have heavy influence in our applications. Sensitivity to the choice of prior parameters should always be investigated for any particular application and a flat prior used if there is no other information. However, all priors have limitations. This prior has an advantage in being simple to extend and interpret. Moreover, many existing methods for incorporating prior information into relationship identification problems can be shown to be special cases or straightforward adaptations of this prior.

Unlike other approaches in the human genetics literature, we will typically not have genome-scan data for the applications we wish to focus on, we may not wish to assume that all parents are either in the sample or else are unrelated to other individuals in the sample, we are not purely interested in pairwise relationships and our interest lies in the true relationship rather than the best from a limited set of alternatives or a reasonable approximation. This approach is hence relevant to wildlife applications where researchers have traditionally been slow to adopt existing methods based on genome scan data because wildlife biologists usually work with small numbers of loci (Blouin, 2003). For instance, it should be routine to calculate the relationships among founders in captive breeding programs but there are very few published examples where this has been done. Likewise, there have been few attempts to use reconstructed sibships as a means of estimating the effective number of founders that contributed to a particular population. Estimation of relatedness amongst individuals in a case-control study and estimation of subgroups of related individuals from a large population-based biobank genetic association study, either to identify those likely to share longer haplotype blocks around disease susceptibility genes of interest, or to construct pedigrees for a subsequent linkage analysis, are other potential applications of this approach.

The fact that pedigree applications can be expressed as *Bayesian networks* (Lauritzen & Sheehan (2003)) permits an interpretation of relationship estimation as a Bayesian network (BN) learning problem with a lot of structural constraints. So far, the existing BN learning algorithms are not appropriate for these problems but one has to suspect that they might be adaptable. A first step in this direction is being made by Angelopoulus & Cussens (2005) in extending their work on defining probability tree-based priors on model structures using stochastic logic programs to pedigrees and then sampling from the posterior distribution via Markov chain Monte Carlo.

Many human applications of relationship estimation are concerned with error detection which tends to be viewed as a separate problem despite the important overlap. There are two main types of error that can occur: pedigree errors which are systematic and affect all loci, and genotyping errors which are sporadic and arise for various reasons including data entry errors, gel misreading or mutation. Distinguishing between them is difficult when limited data are available: Mendelian inconsistencies can be a symptom of either and Mendelian compatibilities do not necessarily imply that both are absent. Despite various claims to the contrary, Mendelian consistency checking of pedigree information can be shown to be an NP-complete problem, and thus it is highly unlikely that popular existing algorithms such as those of O'Connell & Weeks (1999) and Abecasis *et al.* (2001), for example, are of polynomial complexity at worst (Aceto *et al.* 2004). This has obvious implications for the analogous problem of inferring pedigrees. Although forensic markers are generally well chosen, typing errors can be very common in other areas of application. It is possible to model genotyping errors in relationship estimation problems (Boehnke & Cox, 1997 McPeek & Sun, 2000; Sieberts *et al.* 2002; Sobel *et al.* 2002) and, as noted in Section 4, a mutation model can be interpreted in this light. This of course highlights the importance of the prior information in reducing the set of alternatives as the genetic data will never eliminate an implausible option when an error model is included. However, the computational issues still have to addressed. Combining the expertise in all the diverse areas of application is surely a first step.