Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data

Abstract Knowledge of how individuals are related is important in many areas of research, and numerous methods for inferring pairwise relatedness from genetic data have been developed. However, the majority of these methods were not developed for situations where data are limited. Specifically, most methods rely on the availability of population allele frequencies, the relative genomic position of variants and accurate genotype data. But in studies of non‐model organisms or ancient samples, such data are not always available. Motivated by this, we present a new method for pairwise relatedness inference, which requires neither allele frequency information nor information on genomic position. Furthermore, it can be applied not only to accurate genotype data but also to low‐depth sequencing data from which genotypes cannot be accurately called. We evaluate it using data from a range of human populations and show that it can be used to infer close familial relationships with a similar accuracy as a widely used method that relies on population allele frequencies. Additionally, we show that our method is robust to SNP ascertainment and applicable to low‐depth sequencing data generated using different strategies, including resequencing and RADseq, which is important for application to a diverse range of populations and species.

: Scatterplot of the two relatedness coefficients k 1 and k 2 (denoted Z1 and Z2 in the output of PLINK) for each relationship category in the HGDP data. Estimates of the two relatedness coefficients are from the allele frequency-based approach implemented in PLINK. Each pair of individuals within each population is represented by a point, here they are paneled by the inferred relationship category: PO = parent-offspring, FS = full sibling, HS = half sibling, C1 = first cousin, DR = unknown relationship / distantly related, UN = unrelated. Also see figure 3 in the main text.  Figure S4: Pairwise scatterplots of R1, R0, and KING-robust kinship, and also k 1 vs k 2 when difficult to call relationships are excluded. Each pair of individuals within each population is represented by a point, which is colored according to the relationship category inferred using an allele frequency-based approach. Shaded areas and lines show the expected ranges for specific relationship categories, as in Figures 2 and 3 in the main text. PO = parent-offspring, FS = full sibling, HS = half sibling, C1 = first cousin, DR = unknown relationship / distantly related, UN = unrelated. Relationships were deemed difficult to call when the PLINK kinship was within 0.02 of the cutoff between two relationship categories.   2 Supplemental Texts 2.1 Text S1: Derivations of the expectations of R0, R1, and KINGrobust kinship We will here derive expressions for R0, R1 and KING-robust kinship for a range of different pairwise familial relationships. Based on these expressions we will then determine the joint range of expected values for R0 and R1 as well as R0 and KING robust kinship shown in figure 2 in the main text.

Assumptions and notation
In the below derivations will assume that we are analyzing data from two individuals, 1 and 2 that are not inbred and that are from the same homogeneous population. Additionally, we will assume that we have genotype data for n sites from both individuals and that all sites n sites are in Hardy-Weinberg equilibrium. In terms of notation, we will denote the two individuals' genotypes at given site s as g 1 and g 2 , with g 1 ,g 2 ∈ {0, 1, 2} corresponding to the number of copies of a specified allele (e.g., the derived allele) carried by individual 1 and 2, respectively. Also, we will denote the population frequency of the specified allele at site s as f . Furthermore, we will use the capital letters A through I to denote the probability of each of the nine different genotype pairs as shown in figure 1 in the main text. So e.g., A denotes the probability that both individuals have the genotype 0 at a site. Finally, we will use k 0 , k 1 , and k 2 to denote the probability that the two individuals share 0, 1 or 2 alleles identical-by-descent (IBD), respectively. The expected values of K = (k 0 , k 1 , k 2 ) for different familial relationship can be seen in table S1. Half siblings/avuncular/grandparent-grandchild (HS)

Derivations of A through I
To derive the expected value of R0, R1 and KING-robust kinship for different familial relationship pairs, we first derive expressions for A to I, see also Toro et al. (2011) for a similar set of derivations, but notice they have a slightly different definition of k 1 . We note that in general it must hold that in site s: where Z is an indicator of whether the two individuals share 0, 1 or 2 alleles IBD in a given site.
Since we are assuming that the two individuals 1 and 2 are not inbred and that all sites are in Hardy-Weinberg equilibrium, the values of P (g 1 |f ), P (g 2 |g 1 , Z = 0, f ), P (g 2 |g 1 , Z = 1, f ) and P (g 2 |g 1 , Z = 2, f ) must be those given in tables S2 to S5.

Derivation of the expected values of R0
With the above expressions for A through I, we can derive an expectation of R0 for different relationships. We do this by first noting that: Inserting the expected values of k 0 , k 1 and k 2 from table S1 into this formula we get that if individuals 1 and 2 are a PO pair R0 is expected to be Similarly, for monozygotic twins (MZ), full siblings (FS), half siblings/avuncular/grandparentgrandchild (HS), first cousins (C1), second cousins (C2) and unrelated (UR) we expect R0 to be: which for f ∈]0, 1[ has the range ]0, 1 10 ].
which is constant independent of the value for f ∈]0, 1[.

Derivation of the expected values of R1
With the above expressions for A through I we also get that Inserting the expected values of k 0 , k 1 and k 2 from table S1 into this formula we get that if individuals i1 and i2 are a PO pair R1 is expected to be Similarly, for MZ, FS, HS, C1, C2 and UR we expects R1 to be: which for f ∈]0, 1[ has the range ] 1 2 , 10 13 ].

Derivation of the expected values of the KING-robust kinship estimator
Using the above expressions for A through I, the KING-robust kinship estimator (Manichaikul et al., 2010) can be re-written as: Hence, as expected, the expectation of the KING-robust kinship estimator is k1 4 + k2 2 (which is the definition of kinship) regardless of the allele frequencies. Thus using the values in table S1 this means that the expected KING-robust kinship estimate is 1 2 for MZ, 1 4 for both PO pairs and full siblings, 1 8 for HS, 1 16 for C1, 1 64 for C2 and 0 for unrelated pairs.

Joint ranges of R1 and R0
Above we derived the ranges of the expectation of each of R1 and R0 for different relationships.
To get the joint ranges of the two, we note that the two ratios are not independent, because E is a part of both ratios. More specifically, (R1,R0) as a function of f ∈]0, 1[ for each of the different relationships considered here is shown by the solid lines in figure 2A in the main text. As this figure reveals, these are either single points (for PO) or concave, which means that for a combination of frequencies -and thus when more sites than one is considered -the ranges will be in the colored ranges inside the solid lines. It is important to note that these are simply ranges of expectations, because they are based on expected values of k 0 , k 1 and k 2 for the different relationship. Hence, the realized values for a given pair will not necessarily lie inside the ranges shown as the realized values of k 0 , k 1 and k 2 may differ from the expected values because the realized values for any related pair -except for parent off-spring and monozygotic twins -will vary around the expected values of k 0 , k 1 , and k 2 due to the randomness in the recombination process. E.g. a pair of half siblings are expected to have (k 0 , k 1 , k 2 )=(0.5, 0.5, 0) but can in practice end up with e.g. (k 0 , k 1 , k 2 )=(0.55, 0.45, 0) or (k 0 , k 1 , k 2 )=(0.45, 0.55, 0). This will lead to values outside the expected range. In other words, the expectations derived here are expected values for our statistics in the same way as the values in table S1 are the expected values for K=(k 0 , k 1 , k 2 ).

Joint ranges of R1 and the KING-robust kinship estimator
Above we also derived the ranges of the expectation of each R1 and KING-robust for different relationships. We get the joint ranges from figure 2B in the main text by simply combining these. Again, it is important to note that these are simply ranges of expectations, because they are based on expected values of k 0 , k 1 and k 2 for the different relationship. Hence, the realized values for a given pair will not necessarily lie inside the ranges shown as the realized values of k 0 , k 1 and k 2 may differ from the expected values.

Text S2: The IBS method
In this section, we will describe technical details of the IBS method introduced in the main text. Specifically, this method aims to infer the frequency of different genotype combinations, p, for a pair of individuals, 1 and 2 from directly from sequencing read data. This is done using genotype likelihoods, i.e. probabilities of the read data given different genotypes, for each of the two individuals. These genotype likelihoods better reflect the uncertainty of the true genotypes that is inherent to low depth sequencing data. Briefly, this is accomplished by summing over all possible genotypes of the two individuals and weighting the probabilities using the corresponding genotype likelihoods. To fully describe the IBS method, we introduce the following notation: • S denotes the number of sites • p is the vector of the frequencies of the genotype combinations that we aim to estimate With this notation it must hold that: where the genotype likelihoods for the two individuals can be calculated using tools like ANGSD (Korneliussen et al., 2014) and where the probability of each possible genotype combination, P ((Q s 1 , Q s 2 ) = (q 1 , q 2 )|p), is simply the element of p that corresponds to this genotype combination. This equation provides us with a likelihood function for the parameter, p, and we use this likelihood function to perform maximum likelihood estimation of p. In practice, this is done using an EMalgorithm which we have added to the software tool ANGSD with the name IBS.
We note that we have also added other similar models to ANGSD. The only difference between these and the model presented here is that fewer parameters are estimated, i.e. p is a shorter vector, and that P ((Q s 1 , Q s 2 ) = (q 1 , q 2 )|p) is defined differently as a consequence. Those other models are not used in this paper.

Individuals excluded due to signs of admixture or inbreeding
We ran ADMIXTURE (Alexander et al., 2009) separately for each of the seven target populations. In each ADMIXTURE analysis we also include French and Han samples, to aid in identifying European or East Asian admixture, respectively. For the non-African target populations we also include Yoruban samples to identify African admixture. We excluded 16 samples with >5% contribution from more than once ancestry component. We estimated the inbreeding coefficient, F, for each of the remaining individuals using PLINK and excluded two individuals with f> 0.0625. This left us with a total of 142 individuals from the seven populations: Surui N=20, Pima = 20, Karitiana N=21, Maya N=16, Melanesian N=19, Biaka Pygmies N=31 and Mbuti Pygmies N=15.