Inference about complex relationships using peak height data from DNA mixtures

In both criminal cases and civil cases there is an increasing demand for the analysis of DNA mixtures involving relationships. The goal might be, for example, to identify the contributors to a DNA mixture where the donors may be related, or to infer the relationship between individuals based on a mixture. This paper introduces an approach to modelling and computation for DNA mixtures involving contributors with arbitrarily complex relationships. It builds on an extension of Jacquard's condensed coefficients of identity, to specify and compute with joint relationships, not only pairwise ones, including the possibility of inbreeding. The methodology developed is applied in a casework example involving a missing person, and simulation studies of performance, in which the ability of the methodology to recover complex relationship information from synthetic data with known `true' family structure is examined. The methods used to analyse the examples are implemented in the new KinMix R package, that extends the DNAmixtures package to allow for modelling DNA mixtures with related contributors. KinMix inherits from DNAmixtures the capacity to deal with mixtures with many contributors, in a time- and space-efficient way.


Introduction
This article is concerned with probabilistic genotyping methods for DNA mixtures based on unlinked autosomal short tandem repeat (STR) markers, under hypotheses about biological relationships among contributors to the mixture. We concentrate on so-called continuous methods using peak height information, as opposed to binary or semi-continuous methods, and adopt a formal probabilistic modelling approach. Mortera (2020) reviews this field, and the reader should turn to this reference for further background.
Probability models for peak heights are hierarchical, with two main layers: the genotype profiles n of the contributors, and the peak heights z recorded in the electropherogram; the models typically then consist of two components: (a) p(n) -the joint distribution for n -parameterised by population allele frequencies, hypotheses about the contributors, etc., and (b) p(z|n) -the conditional distribution for z given n -with parameters identifying the peak height model and the proportions of DNA from each of the contributors contained in the mixture.
Inference about DNA mixtures usually focusses on the comparison between two hypotheses H p and H 0 concerning the constitution of the mixture, quantified by the likelihood ratio LR for H p versus H 0 . In this article, these hypotheses will be about (arbitrarily complex) relationships between mixture contributors, and between contributors and other typed individuals. Simple examples of tests we can construct are • a paternity test given a child's genotype, where H p and H 0 respectively state that the putative father is a contributor to the mixture, or that no contributor to the mixture is related to the child, and • a test for whether contributors to a mixture, perhaps found at a crime scene, are related in a particular way (H p ) or not at all (H 0 ).
Throughout we use the probabilistic and computational formulation for DNA mixtures of Cowell et al. (2015), and the software implementation of this in the DNAmixtures package of Graversen (2013); the model emulates the PCR process that creates the electropherogram, and recognises artefacts including stutter, drop-out, drop-in and silent alleles. Our new model extensions are largely aimed at modelling the genotype profiles n to express complex relationships; the methods are implemented in a package KinMix (Green 2020a) that supplements DNAmixtures. Early ideas in this direction can be found in Green and Mortera (2017). Although implementation is restricted to this model and this computational environment, the ideas are quite general and could be adapted to other probabilistic genotyping systems.

Encoding relationships via IBD patterns
Under our simplified genetic model of unlinked autosomal STR markers, the sole source of relationship between individuals is identity by descent (IBD). This is the phenomenon that two genes may be identical because they are copies of the same ancestor gene, rather than being independent draws from the gene pool, so that the genotypes of two or more related actors will be positively associated. Given a pedigree, IBD is determined by the meioses generating the genes of each child given those of its parents (by Mendel's first law), and any autozygosity among founding individuals: the actual allelic values are not relevant to this. For two individuals, Table 1 of Thompson (2013) lays out all possible patterns; this table is reproduced below. Thompson credits this formulation to Nadot and Vaysseix (1973), although they do not use a tabular representation.

Coefficient of identity by descent
Two-person relationships are compactly summarised in numerical form using the coefficients of identity by descent (δ i ) and condensed coefficients of identity by descent (∆ i ) of Jacquard (1974) (chapter 6). The δ i are the probabilities for the 15 individual rows of Table 1, the ∆ i those of the 9 subsets of genotypically equivalent states, where we do not keep track of which parent donates which allele. It is these condensed coefficients of identity that we use to characterise and quantify relationships among mixture contributors. Where inbreeding is ruled out, ∆ i = 0 for i = 1, 2, . . . , 6, and we need only the κ coefficients of Cotterman (1940): κ 0 , κ 1 , κ 2 This provides the now generally accepted formulation of the nine IBD states on a pair of relatives due to Jacquard (1974). Despite the simplicity of the law of single-locus Mendelian segregation, computation of the probabilities of these nine state classes on an arbitrary pedigree remains a challenge. Methods based on extensions of Equation 1 to larger numbers of genes were developed by Karigl (1981), and the same approach provides methods for the computation of other probabilities of gene ancestry and gene extinction within defined pedigrees (Thompson 1983). For relationship between a pair of noninbred individuals, the IBD states are much simpler. The two individuals share 2, 1, or 0 genes IBD at any locus, with probabilities k 2 , k 1 , and k 0 , respectively (Table 1).
Inbreeding and kinship coefficients, and more generally probabilities of any IBD state, are expectations of random variables that indicate IBD at a given point in the genome. These random variables also have variance. Conceptually, the pedigree-based inbreeding coefficient of an individual may be thought of as the proportion of between-homolog IBD over descents within the same pedigree at an infinite number of unlinked loci. Different members of the population share some part of their ancestry, with resulting correlations in realized IBD. Within a given pedigree there are both positive and negative correlations affecting the variance of the IBD indicators. For example, consider only the descent from a maternal grandparental couple to a set of siblings. There is positive correlation in the maternal DNA received by the siblings due to their shared descent from grandparents to mother. There is negative correlation between the grandparents and also between the two homologs within each grandparent in their descent to the since each grandchild receives one and only on at a single locus.
As the examples in Sources of variance in scent show, within a finite population there is in the event of IBD (for example, autozygosit the probabilities of such events (for examp coefficients). In addition to the variation resul domness in meiosis and from the different a grees of individuals within a given population consider variation among replicate populatio under a given population process such as ra (Cockerham and Weir 1983). If f is the overal IBD between random gametes in the total col ulation replicates, the total variance is f(1 2 and Weir (1983) partition this total variance in within a population ðs 2 w Þ and that between po cates ðs 2 b Þ. The component s 2 b reflects the va among replicate populations due to genetic d the covariance in IBD within a population relati the larger the variance between, the greater within, relative to the total collection. If a sam viduals is taken from a population, their averag has expectation f and variance s 2 w =n þ s 2 b . A Cockerham and Weir (1983), increasing n d the component of variance due to replication of process.

Coalescent IBD and Ewens' sampling formula
At a point in the genome, IBD among a set relative to time t ago is most easily thought of coalescent ancestry (Kingman 1982). If IBD relative to a time point at which there we lineages, the n gametes are partitioned into As a function of the reference time t, the coale structure on the sequence of IBD partitions, alescent event can only merge two lineages. I of Figure 2A, the n = 6 gametes are partition groups, and the IBD partition is ((g, c, f), ( partition may be characterized by the num groups of size j, where n ¼ P j ja j , and k ¼ example, a 3 = a 2 = a 1 = 1. In terms of the time process, the coalescen backward from the present time, with the n events occurring between a random pair of li proportional to ℓ(ℓ 2 1)/2 when there are ℓ The process may equally be viewed forward coalescent event between a random pair amo ages (backward) corresponds to bifurcation of of the ℓ lineages (forward). The two processe distribution of time between events, but both distribution of tree topologies (Kingman 198 the same distribution of {a j }. The probability tree shapes generated by this random bifurcat process was considered by Harding (1971).
The two gametes of individual B 1 are denoted a and b, and the two gametes of B 2 are c and d. a The pattern is defined by the labeling developed by Nadot and Vayssiex (1973). b The total probability of each subset of genotypically equivalent states is given on the first row. For example, D 3 is the combined probability of states (11 12) and (11 21). c Note that Cotterman (1940) and some later authors use 2k 1 instead of k 1 for this probability.  Table 1 from Thompson (2013) are the probabilities that the two individuals share 0, 1 or 2 alleles by descent. For example, two half-siblings (with unrelated parents) have a relationship summarised by κ 0 = 0.5, κ 1 = 0.5, κ 2 = 0, while two children from an incestuous brother-sister mating are captured by ∆ = (0.06250, 0.03125, 0.12500, 0.03125, 0.12500, 0.03125, 0.21875, 0.31250, 0.06250). The κ and ∆ coefficients can be calculated from a pedigree with the functions kappaIBD and condensedIdentity, respectively, in the package ribd (Vigeland 2019b), part of the pedtools family of packages created by Vigeland (2019a).
For more than 2 individuals, Thompson (1974) seems to have been first to provide a general framework for gene identity given multiple relationships. She provides a rigorous algebraic formalism, with particular attention to enumerating the intrinsic symmetries in the problem, and counts the numbers of possible relationships, which increase very rapidly. For example, for 4 individuals, there are 712 possible (genotypically equivalent) patterns of IBD, reducing to 139 if inbreeding is ruled out. In typical pedigrees, only a very small fraction of these patterns have positive probability, so the vast majority of condensed coefficients of identity are 0. We use what amounts to a sparse representation of such vectors of coefficients, namely a listing of which coefficients are non-zero, and their values. We call this the IBD pattern distribution. These multi-person condensed coefficients of identity can be calculated efficiently from a pedigree using the method of Green and Vigeland (2019); see function pedigreeIBD in the R package KinMix.
We can represent each IBD pattern in various ways. One is by a vector of integer labels, of length twice the number of individuals, n, say, the pair in entries (2i − 1, 2i) representing the genotype of the corresponding individual i = 1, 2, . . . , n. The numerical value of the labels is irrelevant, all that matters is whether two labels are the same or different, so the vector denotes a partition of the 2n genes according to which are identical by descent. Since we are only concerned with unordered pairs of genes, the interpretation of the pattern is unchanged if elements (2i − 1, 2i) are exchanged, and also, of course, unchanged by any 1-1 relabelling. Diagrammatically, the pattern can be represented as a graph with 2n vertices laid out in a n × 2 rectangular array, and vertices connected by an arc if the corresponding genes are identical by descent. Both representations of IBD patterns were used (with n = 2) by Jacquard (1974) (chapter 6). Figure 3 and Table 2 show two examples of IBD pattern distributions in each of these representations.

IBD pattern distribution for a simple pedigree
As an illustration, consider a simple 'triple' of father F, mother M and child C, with the two parents unrelated. If we label the father's genes by (1,2) and those of the mother by (3,4), then the child will have one gene that is either 1 or 2, and another gene that is 3 or 4; thus its genotype is (1,3), (2,3), (1,4) or (3,4) with equal probability. In tabular form, we could write this family's genetic structure at any single autosomal locus in a table as in panel (a) of Table 1, where we have labelled the columns with the individual identities, and the rows with the corresponding probabilities.  Note that separately in each row, we can arbitrarily permute the actual labels, so the first row of Table 1(a) could have been (2,4,1,3,2,1) without changing the meaning; the purpose of the labels is solely to indicate which genes are identical (by descent) and which different. Since a genotype is an unordered pair of genes, the interpretation of the table is also unchanged if any of the individual pairs are transposed, so the first row could equivalently be written (4,2,1,3,1,2), for example. Combining these two rules, and aggregating identical rows, further economy of notation is possible: for example we could simply use the table in panel (b) of Table 1, to represent the same family, saving space and computer time. Effectively 1 and 2 are then being used for the child's paternal and maternal genes respectively.
The example can be extended, by, for example, including also the Father's father, GF. There are two equally likely possibilities: the gene inherited by Father from his father might be that labelled 1 or 3. So the relationships between the 4 individuals can now be represented by panel (c) of Table 1. For the simplest example demonstrating that pairwise relationships do not determine a full description of relatedness among more than two individuals, even in the absence of inbreeding, consider two scenarios in which among three individuals, each pair are full cousins, that is have (κ 0 , κ 1 , κ 2 ) = (0.75, 0.25, 0). This can arise in a 'star' arrangement, where the three have mothers who are full siblings, but unrelated fathers (or vice-versa, of course). In a 'cyclic' arrangement, each pair of cousins have between them parents of the opposite sex who are siblings, with the other parents unrelated. These two families have different overall relatedness, for example in the star arrangement, there is probability 1/16 that the three cousins have a common allele by IBD, while this is impossible in the cyclic arrangement. The two pedigrees are displayed in Figure 2. pr C1 C2 C3 pr C1 C2 C3 0.3750 1 2 3 4 5 6 0.421875 1 2 3 4 5 6 0.1875 1 2 1 3 4 5 0.140625 1 2 1 3 4 5 0.1875 1 2 3 4 1 5 0.140625 1 2 3 4 1 5 0.1875 1 2 3 4 3 5 0.140625 1 2 3 4 3 5 0.0625 1 2 1 3 1 4 0.046875 1 2 1 3 2 4 0.046875 1 2 1 3 3 4 0.046875 1 2 3 4 1 3 0.015625 1 2 1 3 2 3

Pairwise relationships do not determine joint relationships
The respective IBD pattern distributions are shown in Table 2 and visualised in Figure 3; the formats of each are described at the end of Section 2.1; in Section 7, we include a simulation experiment demonstrating the extent to which these pedigrees can be distinguished from DNA mixtures of STR markers with these family members as contributors.

IBD pattern distribution in the Iulius-Claudius pedigree
The Iulius-Claudius dynasty was the first Roman imperial dynasty, consisting of the first five emperors -Augustus, Tiberius, Caligula, Claudius, and Nero -and the family to which they belonged. They ruled the Roman Empire from its formation under Augustus in 27 BCE until 68 CE, when the last of the line, Nero, committed suicide. The name Iulius-Claudius dynasty refers to the two main branches of the imperial family: the gens Julia and the gens Claudia. Figure 4 presents the Iulius-Claudius pedigree. Some of the names have been abbreviated. Figure 5 shows on the left an excerpt (the most probable 9 patterns) of the IBD pattern distribution for 4 components of the pedigree, and on the right, the same for the three emperors Claudius, Caligula and Nero. As one can see from the figure, both Caligula and Nero share many alleles IBD. The probability that Caligula and Nero share two IBD genes, i.e. an IBD pattern (a, b) with a = b is 0.0537.
Caligula and Nero have inbreeding coefficients f C = 0.015625 and f N = 0.0390625. Germanicus and Agrippina Maior, Caligula's parents have probabilities of sharing none and one of their genes IBD equal to κ 0 = 0.9375 and κ 1 = 0.0625.

Likelihoods and Bayes nets
Under our universal assumption that we are using unlinked autosomal markers, and all individuals are drawn from a population in Hardy-Weinberg equilibrium, it is clear that genotypes n and peak heights z are independent across markers. Then the distributions p(n) and p(z|n) are products over markers -and therefore so are likelihood ratios computed from them. For a single marker, the likelihood for observed peak heights z is of course  regarded as a function of the parameters in the two model component distributions. Unless both the number of mixture contributors and the number of alleles are very small, computing this sum is a challenging task, often an intractable one. The key observation in Cowell et al. (2015), exploited in the DNAmixtures package, is that if genotypes n are encoded as allele count arrays (n ia ) giving the number of alleles of type a for individual i, and the joint distribution p(n) factorised into conditional distributions sequentially over a for each i, then p(n) has the structure of a Bayes(ian) net(work) (BN) with considerable sparsity. Computation of n p(n)p(z|n), which is the expectation over the BN distribution for n of the function p(z|n), is then exactly the kind of task performed by a BN probability propagation algorithm (Lauritzen and Spiegelhalter 1988). We follow the DNAmixtures formulation, including the allele count array representation of genotypes; for more computational detail see Graversen and Lauritzen (2015).
The IBD pattern distribution formulation helps to create methodology that considerably extends that described in Green and Mortera (2017). That paper introduced four approaches to adapting the Bayes net computation in DNAmixtures to deal with paternity testing (where the putative father is a contributor to the mixture, and with or without the mother's genotype profile being available in addition to the child's). Three of these: ALN (additional likelihood node), RPT (replace probability tables) and MBN (meiosis Bayesian network) involve modifying the Bayes net and are fast and essentially exact, the other, WLR (weighted likelihood ratio) is slow and approximate but easy to code.
It turns out that the RPT method is easiest to adapt to the case of relationships that are more complex or involve more contributors. All that we need to do is replace the default genotype conditional probability tables (CPTs), representing multinomial draws from the gene pool, by tables that encode assumed relationships and condition on any observed genotypes. Since all genotypes are determined by the values of founding genes and the meiosis pattern, so by including founding genes and the meiosis pattern as nodes in the Bayes net, the CPTs for the individual genotype arrays consist only of 0's and 1's.
Rather than define this process formally in the cumbersome and unilluminating algebra needed for generality, here we work through an example in detail. We begin by showing how the IBD pattern distribution can be used to generate random family genotypes with the required joint distribution, both because this is useful in its own right, for example in simulating cases for testing purposes, and because it helps to motivate how we can construct CPTs for allele count arrays.

Simulation of genotype profiles when contributors are related
Consider the 4-individual family in the last example, and take the array in panel (c) of Table 1. We see that the genotypes of the 4 individuals are determined by 5 founding alleles, labelled 1, 2, . . . ,5, and a binary variable selecting one or other row of the table. For the marker in question, suppose the allele values are denoted a = 1, 2, . . . , A, with frequencies (q a ). In fact, given 5 i.i.d. discrete random variables a 1 , a 2 , . . . , a 5 each with distribution (q a ), and a binary random variable s = 1, 3 with probabilities 1 2 , 1 2 , independent of the (a i ), we can write the 4 required genotypes as Fgt = (a 1 , a 3 ), Mgt = (a 2 , a 4 ), Cgt = (a 1 , a 2 ), and GFgt = (a s , a 5 ).
To see how this can be expressed in allele count array form, note that a variable x taking values in {1, 2, . . . , A} can be represented as a vector of A binary variables (x a ) A a=1 , with x a = δ xa (using the Kronecker delta). Further, we have P (x a = 1|x 1 , x 2 , . . . , x a−1 ) = q a = q a / a−1 b=1 q b if a−1 b=1 x b = 0, 0 otherwise. This describes the way that the founding genes are coded; for the genotypes, where the alleles counts n ia are 0, 1 or 2, we have simply that each count is deterministically the sum of two of the binary founding gene counts, appropriately chosen, rather than the conditional binomial distributions found in Cowell et al. (2015).

CPTs for related contributors
The case where no individuals are genotyped, so we are simply modelling family relationships among some or all of the mixture contributors is very straightforward. Again, we use the IBD pattern distribution directly. Continuing the example above, with the 4 individuals labelled i = 1, 2, 3, 4, the nodes of the required BN represent {n ia , i = 1, 2, 3, 4; a = 1, 2, . . . , A}, {m ja , j = 1, 2, 3, 4, 5; a = 1, 2, . . . , A}, {T ja , j = 1, 2, 3, 4, 5; a = 2, . . . , A − 1}, and s. Here n ia = 0, 1, 2 is the number of alleles a for individual i, m ja = 0, 1 the numbers of a for founding gene j, and the T are cumulative sums of the m (cumulative over a). The algorithm again proceeds recursively down the edges of the Bayes net, but at each node, instead of generating an allele count, we compute its conditional distribution in the form of a conditional probability table.
Figure 6: Graph of Bayes net for a mother/father/child trio, with 3 alleles.

Conditioning on typed relatives
The case where some individuals in the family have been typed is a little more complicated; again similar reasoning applies to both simulation of genotypes, and computations of their CPTs.
Consider the 4-individual family again, and suppose that the Father and Mother are mixture contributors, and the Child and Grandfather are typed, with genotypes (a,b) and (b,c) respectively, where a, b and c are distinct alleles.
Our construction is shown schematically in Table 3. Recalling that genotypes are unordered pairs of genes, we first expand the IBD pattern distribution table by explicitly laying out all possible permutations of the allele labels for the two typed individuals, giving equal probability to each. Matching these four allele labels onto their observed values a,b,b,c respectively allows us to map allele labels onto actual allele values, for each of the permuted patterns. It also reveals that some permuted patterns cannot generate the observed alleles. For example the first row is impossible, given the observed genotypes, since the allele label 1 cannot simultaneously map onto the distinct alleles a and b, while the second row is possible, with the mapping 1 → b, 2 → a, 5 → c. These mappings already determine some of the alleles of the Father and Mother, the mixture contributors. Those that are still not fixed are distinguished in Table 3
We now have all the information needed either to simulate Father and Mother genotypes, or to construct a Bayes net for the genotype allele count arrays of the Father and Mother, in each case conditional on the genotypes of the Child and Grandfather. The Bayes net 'parents' for each count node consist of one node indexing the permuted pattern, together with nodes indicating the values of the founding genes required. The probability distribution over the permuted pattern node is modified from the 'prior' (uniform) distribution by being conditioned on the typed genotypes, that is, it consists of the values of p(Cgt,GFgt) (see Table 3 in the case of our worked example), renormalised to sum to 1.
Pseudo-codes for both the simulation task and the CPT construction task are presented in Appendix 1.

Fgt
Mgt Cgt p(Mgt,Cgt) Fgt 1 2 3 4 1 3 0 1 2 4 3 1 3 0 1 2 3 4 3 1 0 It might be useful to point out that the approach we take to computing conditional genotype probabilities (as a crucial step on the way to delivering likelihood ratios) avoids any manual algebra, which is straightforward in simple cases but can be tedious error-prone otherwise. Of course, it obtains the same answer. To see this, consider the familiar example of paternity testing given both mother's and child's genotypes. For the case where these two genotypes are (a, b) and (b, c) respectively, where again a, b, c are all different, see Table 4, which is in exactly the same format as Table 3. Simple algebra tells us that the Father must donate the c allele, and that his other allele is drawn from the gene pool, and this is exactly the answer that Table 4 provides.

Ambient relatedness and uncertainty in allele frequencies
We have so far focussed on the role in modelling DNA mixtures of close relationships, specified through family structures. In this section, we will briefly touch on the different situation of what we will call ambient relatedness, that is where purportedly unrelated actors in fact have dependent genotypes because the population from which they are drawn exhibits high relatedness, for example through inbreeding. Just as with close relationships, these dependencies are driven by identity by descent, but the impact is somewhat different, both because it applies generally to the whole population, and because the dependence is usually substantially lower in magnitude.
In model-based inference from DNA profiles of STR markers, it has become routine to apply the 'θ correction' of Balding and Nichols (1994). The scalar parameter θ can be identified with Wright's measure of interpopulation variation F ST (Wright 1940;Wright 1951), and informally interpreted as the 'proportion of alleles that share a common ancestor in the same subpopulation'. As discussed by Balding and Nichols (1994), the parameter θ arises in various models for dependent populations, for example the 'island model' of partially-separated sub-populations. Green and Mortera (2009) observed that exactly the same probabilistic model for the joint distribution of multiple genes arises in a simple model for uncertainty in allele frequencies (UAF), in which the true allele frequencies are treated as unknowns with a Dirichlet distribution and the database used for calculation regarded as a multinomial sample from these true frequencies. The parameter α = (1 − θ)/θ is then the effective size of the database. Green and Mortera (2009) also observed that this model is amenable to implementation as a Bayes net, as an alternative to algebraic specification; this Bayes net actually encodes a Pólya urn scheme.

Ambient IBD for allele count arrays
In discussion of Cowell et al. (2015), both Tvedebrink (modelling kinship) and Green (modelling uncertainty in allele frequencies) observe that when genotypes are represented by allele counts arrays n ia , the number of alleles a of individual i, this Pólya urn scheme can be expressed through the recursion where n <i,a = i−1 j=1 n ja (etc.), and DM denotes the Dirichlet-Multinomial distribution. See Green (2015); Tvedebrink (2010). The Dirichlet-multinomial distribution is the straightforward generalisation of the Beta-binomial to more than 2 categories. It is a Dirichlet mixture of multinomial distributions. X ∼ DM(n, (α a ) A a=1 ) means so long as a x a = n.
The KinMix package includes functions for modifying genotype CPTs to model UAF and ambient IBD. Tvedebrink et al. (2015) give a fuller analysis, including a quantitative study of the implications. They show that relatedness/uncertainty in allele frequencies can increase or decrease LR's in identification tasks.

UAF and IBD together
Particular casework may involve both close relationships and uncertainty in allele frequencies (or ambient relatedness). Modelling such a situation combines elements from Sections 3.3 and 4.1. Full details are omitted, but the key algebraic manipulations are given in Appendix 2. The algorithmic implications are that Binomial distributions in the algorithms in Appendix 1 are replaced by Beta-Binomials, with the meiosis pattern as an additional parent at each instance of the Pólya urn. See also Cowell (2016) for a different analysis of this combined model.
At present, simultaneous modelling of close relationships and uncertainty in allele frequencies is not implemented in the KinMix package. Previous work with allele-presence data only (Green and Mortera 2009) and the work of Tvedebrink et al. (2015) show that the numerical difference in log-likelihood ratios due to uncertainty in allele frequencies is rarely important in scientific terms.

Setting parameters
The methodology of this article is based on the joint probability model p(n, z) = p(n)p(z|n) for the genotype profiles and peak heights, and this distribution has a number of parameters, notably the population allele frequencies, the relative proportions of the contributions of the different contributors to the mixture, and the parameters describing the PCR process and the artefacts of measurement embedded in the Cowell et al. (2015) peak height model. We do not prescribe any particular approach to setting these parameters when evaluating the likelihood, as this choice must be strongly influenced by regulation and practice in the judicial regime in which the analysis of the mixture is to be used, and the particular question that is being addressed.
Although Bayes nets are a key concept in the computations we use, this does not mean that a Bayesian formalism is intended. In fact the Cowell et al. (2015) model is presented as entirely frequentist, with the BN computations simply a device to compute an otherwise intractable likelihood function. The DNAmixtures package includes a function for maximising the likelihood as a function of the model parameters (the relative proportions of the contributions of the different contributors to the mixture, and the parameters describing the PCR process and the artefacts of measurement), and that function applies equally to a model modified using KinMix.
In principle a Bayesian analysis could be conducted, and the fast calculations of the likelihood that the packages provide would be an asset in implementing a Markov chain Monte Carlo (MCMC) sampler for posterior simulation, but we have not attempted this.
Considering now only maximum likelihood estimation (MLE), when we are comparing alternative models for mixtures, the question arises under which model(s) should the likelihood by maximised? The answer depends on context and perspective. In a likelihood ratio test of H p against H 0 , the respective likelihoods would each be maximised separately, and the ratio of the maximised values used as the test statistic in the usual practice in statistical methodology. However, when using a ratio of likelihoods as a measure of the weight of the evidence, we are not appealing to conventional statistical testing theory. In a criminal trial, depending on jurisdiction, custom might suggest or dictate choosing parameter values for both numerator and denominator that minimise the ratio, or perhaps maximise the denominator, in line with the presumption of innocence of the defendant (in dubio pro reo). In a civil case, say a paternity suit, the notion of defendant hardly applies.
Our experience has been that in many contexts, choice between these approaches makes little difference to the values of likelihood ratios, or their interpretations; we give a numerical example of this below. However, this cannot be a general conclusion; we anticipate, for example, that in comparing alternative hypotheses about the relatedness of mixture contributors to each other, parameter choice could have more impact.
Our numerical illustration revisits the Italian singer case, used for motivation in Green and Mortera (2017) (sections 4 and 5). This is a case of paternity testing, where the putative father is represented in the evidence only as a contributor to a mixture, assumed to be of 2 contributors. The child's genotype Cgt is available, and we presented weights of evidence for paternity, with and without using also the mother's genotype Mgt. The hypotheses are H p , that the major contributor to the mixture is the father, against H 0 , that neither contributor to the mixture is related to the child (or mother).  The modelling and methods introduced in this paper form the basis for an R package KinMix (Green 2020a), that extends the package DNAmixtures (Graversen 2013). As with DNAmixtures, therefore, the package relies on the Hugin system for probabilistic expert systems, accessed via the Rhugin package. The model for peak heights given genotypes p(z|n), together with the treatment of artifacts such as drop-out and stutter, are exactly as in Cowell et al. (2015). The KinMix package provides functions for constructing IBD pattern distributions from pedigree information, and using these pattern distributions to modify the default DNAmixtures genotype profile distribution p(n) (in which untyped individuals are assumed unrelated draws from a specified gene pool), to allow for related contributors.
KinMix also provides facilities from simulating genotype profiles from groups of individuals with specified joint relationships, making graphical displays of joint relationships, and many other utilities. In addition to the package manual pages in standard R format, a tutorial user guide is available as Green (2020b).
KinMix inherits from DNAmixtures the representation of genotypes via allele count arrays, which is key to saving both computation time and computer memory, and allows the modelling of mixtures with as many as 5 or 6 untyped individuals. Although building in relationships between individuals does increase both time and space requirements, examples in this paper demonstrate that quite complex problems can be considered. Under standard assumptions, once parameters are fixed, likelihoods and likelihood ratios in our mixture models factorise over markers. Further, in the case of related contributors, logarithms of likelihoods and likelihood ratios are weighted averages of those obtained from the individual IBD patterns. The weights are the probabilities of the patterns, the posterior probabilities in the case where some of the relatives have been typed. An option in KinMix allows exploiting these facts, with the effect of considerably reducing the storage requirements when many markers are involved, since separate BNs are used for each marker/IBD pattern combination.

Simulations
In this section, we examine the performance of log 10 -likelihoods, based on the Cowell et al. (2015) model, at discriminating between different joint relationships, in a study using simulated electropherogram (EPG) and genotype data. Each simulated data set consists of genotype profiles generated from a prescribed 'True' model, using the generative model in Section 3.2; from each we generate artificial EPG data using the pcrsim (Hansson 2017) package, which simulates the DNA amplification process. These PCR simulations were based on using the AmpF STR TM SGM Plus TM PCR Amplification Kit, with the Norwegian SGM database, on 10 STR markers, together with Amelogenin, all available in pcrsim. The numbers of cells amplified varied from 200 to 50 1 , The resulting data are analysed using KinMix (Green 2020a), under a variety of assumed models. In each case, we compute the log 10 likelihood ratio against a baseline model that assumes the same number of unrelated contributors. These log-likelihood-ratios are tabulated or graphed for a small number of replicates of the PCR process, for each of a small number of independently generated genotype profiles, thus giving an idea of the variation attributable to these two sources. Parameters for the DNAmixtures peak height model are estimated by maximum likelihood using that package, assuming the appropriate number of unrelated contributors, estimated separately on each simulated EPG.
In some of our experiments, we are illustrating use of these log-ratios as test statistics about hypothesised joint relationships, in others simply as measures of support for the hypothesised model, weights of evidence on a familiar scale.
Each analysis in the following experiments involves specifying the true joint relationship between the actors involved, the hypothesised relationship(s), which of the actors contribute to the mixture, and which if any of the actors are genotyped.

Two-way relationships
In this experiment, we study DNA mixtures with two contributors, and no other typed actors. We consider 5 possible familial relationships between the two contributors; for each relationship, we simulate EPG data and fit mixture models hypothesising each of the 5 relationships in turn. In Table 7, we report the median log 10 LR for each of the hypothesised models, over 4 replicated EPGs for each of 4 replicated genotypes.
The variation in these log 10 LRs across replicate genotype profiles and EPGs is depicted in Figure  7. The rows of the figure correspond to the true relationships, and the columns to the 4 replicate genotype profiles. Within each panel we see a colour-coded diagram showing the variation in log 10 LR over the 4 EPG replicates. The log 10 LRs when parent-child is hypothesised are suppressed from the Figure as they take extreme values, as can be seen in the parent-child column of Table 7. The highest median log 10 LRs are all found down the diagonal, so that if we select a model on the basis of the largest, then on average, we correctly identify the true model in all 5 cases. This is most pronounced when the true model is parent-child, a pattern that is perhaps to be expected given the κ coefficients, also tabulated in Table 7. Some of the other relationships are harder to distinguish. Also, recall that many relationships, like half-sibs, aunt/uncle, grandparent etc. have identical IBD pattern distributions and κ coefficients. Perhaps a more meaningful interpretation is that on an individual-replicate basis, for the 5 true relationships, in 11,11,9,4 and 12 out of the 16 = 4 × 4 replicates respectively, the correct model was identified.
When there is additional information, such as the genotypes of individuals potentially related to mixture contributors the evidence becomes much stronger, as we will see in some of the following examples.

Three-way relationships
This experiment is exactly similar to the previous one, except now we consider relationships between 3 related contributors. The five considered relationships are respectively trio (mother, father and child); a mother and two children; 3 sibs; 3-cousins-cyclic, 3-cousins-star; the last two are defined and illustrated in Section 2.3. Again, there are no typed actors.  Table 8 gives the median log 10 LR relative to the baseline (three unrelated contributors) for comparing each combination of true and hypothesised relationship model, over 4 replicated genotypes by 4 replicated EPGs. The highest median log 10 LRs are again all found down the diagonal, implying that all 5 relationships are correctly identified on average. This effect is strongest when the relationships are mother and 2 children, or 3 sibs. In these two scenarios the 3 contributors all have a close pairwise relationship, whereas in the trio, the mother and father are unrelated. Also, the 3 cousins scenarios have more distant relationships among each other, and of course have identical pairwise relationships, so are harder to distinguish. On a per-replicate basis, for the 5  Figure 7: Variation in the log 10 LRs across replicate genotype profiles and EPGs for the two-way example. The rows of the figure correspond to the true relationships, and the columns to the 4 replicate genotype profiles. Within each panel log 10 LR is plotted against hypothesised relationship (colour-coded: blue, red, green and orange for sibs, half-sibs, cousins and half-cousins, respectively). Lines join values corresponding to the same EPG replicate. The log 10 LRs when parent-child is hypothesised are suppressed from the Figure as they take extreme values, as can be seen in the parent-child column of Table 7. true relationships, in respectively 6, 9, 15, 8, 11 out of the 16 = 4 × 4 replicates, the correct model was identified. Figure 8 shows the variation in log 10 LRs within the genotypes and across the 4 replicated EPGs in each row panel for 5 different 3-way relationships. As in the previous section, each row corresponds to a true relationship, and the four panels to the genotype profile replicates.

Three-way relationships, with a relation genotyped
Here we consider 4 brothers, and DNA mixtures simulated from a true model in which three of the brothers are contributors. We consider testing H p : 3 brothers contributed to the mixture vs. H 0 : the 3 contributors are unrelated, and drawn from the gene pool. We perform this test with and without the assumption that the 4th brother is genotyped, yielding genotype Bgt, and as usual we generate 4 replicate joint genotype profiles, and 4 replicate EPGs for each.
This kind of case can arise when brothers are engaged in a joint criminal activity and DNA might be found on, e.g. a get-away car, balaklava, banknote, crowbar, or gun.
The log 10 LR results are shown by replicate in Table 9 and Figure 9. Note that in most but not all of the 16 = 4 × 4 replicates, there is much greater weight of evidence that the 3 brothers are in the mixture when the 4th brother's genotype is available.

Incestuous sibs
Our remaining examples consider incestuous relationships in two person DNA mixtures of unknown or partly known contributors, in cases of possibly incestuous relationships. Unlike some other software, KinMix does deal correctly with incestuous relationships. This section concerns incest between sibs. The setup we consider is of a father/mother/child trio, and a mixture where the contributors are the mother and child. Cases like this occur when a mother-fetus mixture is found and we wish to test the paternity. The hypotheses entertained are H p : the father and mother are siblings, as in the pedigree in Figure 10(a) and H 0 : the father and mother are unrelated. The EPG data are simulated under H p in this experiment. Table 10 shows the IBD pattern distribution of the three genotypes under H p .
The results for testing whether there was incest or not are shown in Table 11 and Figure 11, the table giving medians over the replicates, the figure showing the variation across 4 × 4 replicates. In this example, there is very little variation across EPGs with genotype profiles.
Some of the dependency visible here on which actors are genotyped may seem counter-intuitive. For example, why does typing both Father and Child give apparently less clear evidence of incest than typing either one of Father and Child or neither separately, and indeed in some replicates give  Figure 8: Variation in the log 10 LRs across replicate genotype profiles and EPGs for the three-way example. The rows of the figure correspond to the true relationships, and the columns to the 4 replicate genotype profiles. Within each panel log 10 LR is plotted against hypothesised relationship (colour-coded: blue, red, green, orange and purple for trio, mother and two children, 3 sibs, 3cousins-cyclic, 3-cousins-star, respectively). Lines join values corresponding to the same EPG replicate. All values are truncated below at −40 before plotting.   1 2 1 2 1 1  0.125 1 2 1 2 1 2  0.125 1 2 1 3 1 1  0.125 1 2 1 3 1 2  0.125 1 2 1 3 1 3  0.125 1 2 1 3 2 3 0.25 1 2 3 4 1 3 evidence against incest? It is important to remember that as in all examples in our simulations, the focus of interest is on the log 10 LR from the peak height data given the stated available genotypes. Taking the peak height data and genotype data together (not shown here), as expected, removes the apparent paradox. A more careful study of the conditional dependencies in this example reveals that the peak heights convey no information about incest if the Mother and one or other or both of the Father and Child are typed.  Figure 11: log 10 LRs for the incestuous sibs example. The panels represent the replicated genotype profiles. In each panel, variation in log 10 LRs is shown across replicated EPGs, colour-coded by which actors are genotyped: dark blue, red, green, orange and purple for F and C, F, M, C and none, respectively.

Incest and rape
Here we consider the horrendous scenario where a child has been raped, and we have a mixed trace from her vagina. The suspected culprit is her maternal grandfather, and additionally there is some suspicion that the grandfather is also her father, that is, that she is the offspring of an incestuous relationship between her mother and maternal grandfather. We assume the child is the major contributor to the mixture, and that there is one other contributor. The pedigree for the case of incest is shown in Figure 10(b), and this pedigree is assumed in simulating our EPG data for this study. In all cases we do not question that M is the mother of C, and that GF is the father of M. Table 12 shows the IBD pattern distribution of the three genotypes for an incestuous family where the maternal grandfather GF of a child C is the father of the mother M. Table 12: IBD pattern distribution for the incestuous case where the grandfather GF of a child C is the father of the mother M.
For brevity, in this example we will use the term rape to mean that the GF is the 2nd contributor to the mixture, and incest to mean that GF is father of C. We wish to examine whether it is possible from the DNA mixture and any typed genotypes to distinguish the possibilities of incest and/or rape. Table 13 reports the log 10 LRs for each of (i) rape assuming incest, (ii) incest assuming rape, (iii) rape assuming no incest, and (iv) incest assuming no rape. In Figure 12 are boxplots showing the variation in the log 10 LRs by replicate.
Note that for the test of incest assuming no rape, when the child's genotype is known, a conditional independence argument confirms that the log 10 LR is identically 0. We can see that there tends to be a stronger signal for rape than for incest, and also that when we have the additional information on the child's genotype the log 10 LR becomes much larger. Having only the mother's genotype does not make a substantial difference.

Real case applications
Here we analyse a real case related to a missing male, provided by the Forensic Sciences Institute, University of Santiago de Compostela. We refer to Green et al. (2020) for more details of the case analysis. Here we will revisit some of the results, and also compare them with an analysis made with different software.
In this case, only a daughter of the missing male was available to donate a DNA sample. This is not the ideal situation since it is known, for example, that false DNA matches can be found after a  Figure 12: Variation in the log 10 LRs across replicate genotype profiles and EPGs for the incest and rape example. The rows of the figure correspond to the various tests, and the columns to the 4 replicate genotype profiles. Within each panel log 10 LR is plotted against which actors are genotyped (colour-coded: blue, red, green, and orange for mother and child, mother alone, child alone, or no one, respectively). Lines join values corresponding to the same EPG replicate. mass comparison of profiles in a database when only one relative is available as a reference sample. In order to augment the reference genetic data, a toothbrush and a razor-blade, presumably used by the missing person, were also collected. DNA from both objects was recovered and analysed on 21 markers included in GlobalFiler TM PCR amplification kit (ThermoFisher). The reference sample from the daughter of the missing male was also genotyped with the same kit. Two different two-person DNA mixtures were detected, one of each object. An excerpt of the data is shown in Table 14, showing the alleles and peak heights in the two DNA mixtures found on the toothbrush T and the razor-blade RB. The DNA profile of the daughter, denoted by D, is also shown.

Results
Here we analyse the DNA mixtures found on the toothbrush T , and a razor-blade RB, presumed used by the missing person. We assume known allele frequencies taken from the Spanish allele frequency database collected on n = 284 individuals (Garcìa et al. 2012). In all the analyses presented we adopt a threshold of 50 rfus (relative florescent units).   Table 15 shows the estimated parameters ψ = (µ, σ, ξ, φ) in the Cowell et al. (2015) model for the analysis of the DNA mixtures found on T and RB. We assume there are 2 unknown contributors U 1 and U 2 to both T and RB; these are not necessarily the same individuals in the two samples. An analysis performed for 3 unknown contributors (not shown here) yielded an almost vanishing proportion for the third contributor. The estimated proportion of DNA for the two contributors for sample T is large for the major contributor U 1 , φ U 1 = 0.93, whereas, for sample RB, U 1 and U 2 contributed in almost equal proportions to the mixture, φ U 1 φ U 2 = 0.5. Note that in these models, the likelihood can have a complicated shape and be difficult to safely maximise numerically, and in the latter case the estimation of the LR and other inference is problematic. The values in Table 15 are the maximum likelihood estimates, as calculated by DNAmixtures. Table 16 shows the LR and log 10 LR for testing H p : D is the child of U 1 (and similarly for U 2 ) vs. H 0 : no unknown contributors are related to D. For item T , log 10 LR = 10.97 is large, pointing to U 1 being the father of D. It is also substantial for the hypothesis that U 2 is the father of D. In this section, we refer only to the DNA mixture on the toothbrush. Table 17 presents a marker-wise comparison between the likelihood LR and the overall log 10 LR, for comparing H p vs. H 0 , when using KinMix with and without the peak height data and using relMix (Hernandis et al. 2019). The R package relMix, like KinMix, analyses DNA mixtures involving relatives, but is based only on allele presence, not considering the peak heights when modelling the DNA mixture. Note that on comparing columns 2 and 3 to column 4 in Table 17, for only 2 markers out of 20 does using the peak heights yield a smaller log 10 LR than using allele-presence data alone. The results obtained with relMix and KinMix when not including the peak height information are quite similar, although based on different models. The overall log 10 LR on all the markers computed by KinMix, with peak heights, is 10.97, compared with the result 9.53 when not using the peak height information, a LR 27.5 times smaller.

Comparison with relMix
It seems that relMix is not able to compute the LR when there are more than 16 different alleles on a marker in the allele frequency database, as is the case for markers D12S391 and SE33. We indicate this in Table 17 by NaN. When using only the markers that relMix is able to compute, the partial log 10 LR obtained with KinMix with peak height information, is 9.94, is again substantantially bigger than that obtained without peak height information (8.35 with relMix, 8.42 with KinMix).
The time to do the computations with relMix is considerably longer; it takes 1,698 seconds (almost half an hour) compared to 2.28 seconds for KinMix without peak height information, and the 1.38 seconds that KinMix takes using peak heights. (These times were obtained with an i7-7600U processor clocked at 2.80GHz.) This discrepancy between computational times is likely to be due to the fact that KinMix represents relationships in a DNA mixture in a compact, time-efficient way by representing genotypes using a Markov structure in a Bayesian network, so that computations can be done linearly in number of alleles. On the other hand, relMix explicitly enumerates all possible combinations of alleles which is a combinatorially intensive computational task.

Discussion
We have shown that the IBD pattern distribution for a collection of related individuals, which extends Jacquard's concept of coefficient of identity by descent beyond pairwise relationships, is an invaluable approach both to encoding relationships and to Bayes net computations for DNA mixture analysis involving family relationships. Implementation of these ideas in the package KinMix, extending DNAmixtures, provides a convenient, powerful and flexible means for delivering the computations needed for DNA mixture analysis, using peak heights, involving arbitrarily complex relationships.

Appendix 1: Algorithms in pseudo-code
See Algorithm 1 for the pseudocode for the generative model. The algorithm for the construction of the CPTs for the Bayes nets is essentially the same except that random variables are replaced by their probability distributions in table form, as in Algorithm 2.
The notation used is as follows. The input variables are a coding of some of the information illustrated in Table 3: π is one of the permuted patterns that can generate the typed individual genotypes (in this example, 5 in number); ncontr is the number of mixture contributors whose genotypes's conditional distributions we are modelling (in the example, 2: Fgt and Mgt), these are numbered 1 = 1, 2, . . . , ncontr in the pseudocode; g = 1, 2 indexes an individual's paternal and maternal genes; draw πig is a boolean variable, true if the gth gene of individual i has to be drawn from the gene pool under permuted pattern π, false if it is determined by the typed genotypes; par πig is an integer parameter, the allele index if draw πig is false, a running count of genes to be drawn under this pattern if draw πig is true; p(π) is the probability distribution of π given the observed genotypes (in this example, given Cgt and GFgt); ndraws is the maximum over π of the number of draws needed under pattern π; (q a ) are the allele frequencies . As in our example so in general, all of these input values are easily determined from the specified IBD pattern distribution and the allele frequencies.
The output variables are the genotype allele count arrays (n ia ) for the mixture contributors {i}, whose distribution conditions on the observed typed genotypes of their relatives, together with the latent variables π, the permuted pattern, and (m ja ), the allele count arrays for the founding genes.

Appendix 2: UAF in complex problems involving relationships
Probably omit this appendix, moving it to supplementary material? Let T be the typed genotypes, C the genotypes of the contributors to the mixture, M the meiosis pattern, f the founding genes, and q the allele frequencies.
Then we want P (C|T ). Clearly Similarly, if f C|T,M has n a copies of allele a, then p(f T,M |q)p(f C|T,M |q)p(q)dq = Γ( a δ a ) a Γ(δ a + n a + n a ) Γ( a (δ a + n a + n a )) a Γ(δ a ) = DM (n + n ; δ).
(These Dirichlet-Multinomial distributions are what is evaluated in the Pólya urn BN.)

Computation of P (C|T )
T is observed, so we only need P (T |M, f T,M ) for a single value of T , and therefore P (T ) becomes an explicit sum over M and f T,M , with only a small number of non-zero terms. We need P (C|T ) for all possible values of C, so to compute P (C|T ) we use a BN whose nodes include f C|T,M and M (T being now fixed and f T,M a fixed function of M ). In this BN, if the evidence and joint probability has product P (C, T )/P (T ), then the normalising constant will evaluate P (C|T ). However, note that n are the allele counts corresponding to f T,M , so for fixed T vary with M . This means that the Pólya urn component of the BN will have the node corresponding to M as an additional parent. Alternatively, we have to loop over M , running a separate BN for each value (for which p (M ) is non-zero), and combining them afterwards.