Graphics for relatedness research

Abstract Studies of relatedness have been crucial in molecular ecology over the last decades. Good evidence of this is the fact that studies of population structure, evolution of social behaviours, genetic diversity and quantitative genetics all involve relatedness research. The main aim of this article was to review the most common graphical methods used in allele sharing studies for detecting and identifying family relationships. Both IBS‐ and IBD‐based allele sharing studies are considered. Furthermore, we propose two additional graphical methods from the field of compositional data analysis: the ternary diagram and scatterplots of isometric log‐ratios of IBS and IBD probabilities. We illustrate all graphical tools with genetic data from the HGDP‐CEPH diversity panel, using mainly 377 microsatellites genotyped for 25 individuals from the Maya population of this panel. We enhance all graphics with convex hulls obtained by simulation and use these to confirm the documented relationships. The proposed compositional graphics are shown to be useful in relatedness research, as they also single out the most prominent related pairs. The ternary diagram is advocated for its ability to display all three allele sharing probabilities simultaneously. The log‐ratio plots are advocated as an attempt to overcome the problems with the Euclidean distance interpretation in the classical graphics.

Relatedness investigations can be carried out in an entirely numerical manner by inspecting estimated IBS (identity by state) and IBD (identity by descent) probabilities, likelihood ratios or confusion matrices (Boehnke & Cox, 1997;Epstein, Duren, & Boehnke, 2000).
Graphics greatly facilitate the interpretation of the results of relatedness studies and are increasingly being used (Abecasis, Chemy, Cookson, & Cardon, 2001;Pemberton, Wang, Li, & Rosenberg, 2010;Rosenberg, 2006). The main aim of this article was to summarize the state of the art of the graphical methods used in relatedness research. Relatedness investigations are based on allele sharing, and we will consider techniques that use IBS alleles as well as those using IBD alleles. A plot of the means against the standard deviations of the IBS counts is a powerful tool to detect relatedness (Abecasis et al., 2001). We explore this tool in detail and establish the domain of this graphic from a mathematical point of view. Plots of the proportions of markers with 0, 1 or 2 IBS counts (p 0 , p 1 or p 2 ) are often used to assess the existence of family relationships (Rosenberg, 2006). Nevertheless, if the researcher is interested in identifying the degree of relatedness, plotting the probabilities of sharing 0, 1 or 2 IBD alleles (k 0 , k 1 or k 2 ) is the best strategy. The IBD probabilities depend directly on relatedness and enable us to accurately infer the type of relationship. In addition to the former graphical methods, we propose to use graphics from compositional data analysis (CoDA) for both IBS and IBD allele sharing studies. Due to the fact that the proportions (p 0 , p 1 , p 2 ) and the probabilities (k 0 , k 1 , k 2 ) are constrained to sum to one, it is possible to apply all the graphical and analytical CoDA techniques introduced by Aitchison (1986) and developed posteriorly by Pawlowsky-Glahn and Buccianti (2011).
Two graphics, commonly used in CoDA, are of particular relevance for relatedness studies: the ternary diagram (also known as a de Finetti diagram in genetics) and a scatterplot of log-ratios. We show the ternary diagram to be useful for plotting the proportions of the IBS counts and for plotting the estimated Cotterman coefficients (IBD probabilities). Moreover, the theoretical IBD sharing probabilities for the standard family relationships can be used as reference points in the ternary diagram (Thompson, 2000). Furthermore, the CoDA techniques allow us to introduce the isometric log-ratio coordinates (ilr-coordinates) of the vectors p = (p 0 , p 1 , p 2 ) and k = (k 0 , k 1 , k 2 ), which we can represent in a scatterplot. These ilrcoordinates allow us to measure the degree of similarity between two vectors of IBS proportions or IBD probabilities. The graphics we propose are of universal value and can be used in any relatedness study that concerns diploid individuals.
The remainder of this article is organized as follows. Section 2 gives an overview of the IBS allele sharing analysis and the graphical methods used to detect family relationships. Section 3 presents the basic principles of IBD estimation and the most common graphics used for relatedness estimation in the IBD context. The former sections also detail the graphical methods from the field of CoDA used in IBS-IBD approaches: the ternary diagram and the scatterplot of log-ratios. Section 4 presents a way to enhance IBS and IBD graphics with convex hulls that express the degree of uncertainty about a relationship. Section 5 presents a case study with individuals from the Maya population. Finally, Section 6 summarizes the principal conclusions of this article and the pros and cons of each graphical method are discussed.

| IBS STUDIES
IBS studies disregard if the alleles for any diploid individual are derived from a common ancestor. IBS allele sharing concerns the number of matches between the alleles of the genotypes of two individuals. Two diploid individuals can share 0 (e.g., A1/A1 and A2/ A2 or A1/A2 and A3/A3), 1 (e.g., A1/A1 and A1/A2 or A1/A2 and A1/A3) or 2 (e.g., A1/A1 and A1/A1) IBS alleles for a specific genetic marker, and we will refer to these as IBS counts. To detect family relationships in a given population of n individuals and m genetic markers, the number of matches between IBS alleles (the IBS counts) is considered for each pair of individuals across genetic markers.
That is, we move from a data set of n individuals and m genetic markers to a data set of n 2 pairs of individuals with the information of the IBS counts for m genetic markers. There are different ways to deal with this type of data as described below. First, we focus on the plot of means and standard deviations of the IBS counts (Abecasis et al., 2001). Second, we detail the plot of the proportions of the IBS counts (Rosenberg, 2006). To conclude this section, graphics from CoDA (Aitchison, 1986;Pawlowsky-Glahn & Buccianti, 2011) are presented.
To illustrate the different IBS graphics that are introduced in this Section, we use five pairs of individuals with the information of IBS counts and IBS proportions for 377 microsatellites (see Table 1). The individuals are from the Maya population which we will analyse in Section 5. We consider a parent-offspring (PO) pair, a full-sib (FS) pair, a half-sib (HS), avuncular (AV) or grandparent-grandchild (GG) pair, a pair of first cousins (FC) and a pair of unrelated individuals (UN). We discuss the different graphics in the sections below.

| ð x; sÞ-plot
Let x ijk be the number (0, 1 or 2) of shared IBS alleles between individual i and j for the genetic marker k. Abecasis et al. (2001) proposed to compute the mean ( x ij ) and variance (s 2 ij ) over K genetic markers. The plot x ij versus s ij reveals characteristic clusters that correspond to the different family relationships for a given population.
The statistics x ij and s 2 ij are constrained due to the limited number of outcomes (0, 1 or 2), and we proceed to derive their range of variation ( Figure 1a). As an example, we consider a table with all possible outcomes of the allele sharing counts (0, 1 or 2) for a set of 100 markers. The rows of this table represent possible pairs of individuals. There are 3 100 combinations (rows), if the order of the outcomes is considered relevant. However, in terms of means or standard deviations, the order of the IBS counts (0, 1 or 2) over the different markers is irrelevant but their multiplicity is important. For example, a pair of individuals sharing 1 IBS allele for the first marker and 0 for all other markers will have the same mean and variance as a pair of individuals sharing 1 IBS allele for the k-th marker and 0 for all others. Mathematically, the combinations of the IBS counts for a pair of individuals form a multiset (Stanley, 1997, Section 1.2) of cardinality m (the number of markers) made of a basic set of cardinality k = 3 (the outcomes 0, 1 and 2). The possible number of ( x; s) pairs in the plot can be no larger than the number of multisets of cardinality k, where the latter is given by the multiset coefficient Thus, for 100 genetic markers there will be at most pairs. Figure  The red points on the right hand curve of the "umbrella" correspond presumably to parent-offspring relationships for having a mean larger than 1 and low variance. The first point of the curve T A B L E 1 Computations for five pairs of individuals from the Maya population. Mean and standard deviation of IBS counts, proportion of sharing 0, 1 and 2 IBS alleles (p 0 , p 1 , p 2 ) and estimated Cotterman coefficients (k 0 ;k 1 ;k 2 ) are shown  with mean equal to 1 IBS allele and standard deviation equal to 0 IBS alleles corresponds to an array of one hundred ones. The second point of the curve corresponds to an array of 99 markers with 1 IBS alleles and one marker with 2 IBS alleles, and so on. In other words, this red curve represents the pairs of individuals who have a mean larger than or equal to 1 and the smallest standard deviation of all possible IBS counts. This can be related with the fact that the probability of sharing 1 IBD allele between a parent-offspring equals 1, as we will see in the next Section (Table 2). For parent-offspring pairs, we have that x ij ! 1 because children inherit at least 1 IBS allele from their parents. And for monozygotic twins (MZ) or duplicated individuals, we have x ij ¼ 2 and s ij = 0 (green point in Figure 1a). Figure 1b shows the ( x; s) plot for the five Maya pairs in Table 1.
The larger the mean of the IBS counts for any pair of individuals, the more likely they are to be closely related. The PO pair (red point) is located on the right hand curve of the umbrella, the FS pair (blue point) with mean larger than 1 is separated from second-and third-degree family relationships (violet and gold points respectively), whereas, the unrelated individuals have the smallest mean (green point).

| (p i , p j )-plots
Let x ij be the vector of the IBS counts between individual i and j as large as the number of the genetic markers in the data set. Let p 0 , p 1 and p 2 be the proportions of 0, 1 and 2 IBS alleles, respectively, for each pair of individuals. Rosenberg (2006) proposed a graphical method for relatedness research by plotting the proportion of sharing 2 IBS alleles (p 2 ) versus the proportion of sharing 0 IBS alleles (p 0 ) for all pairs of individuals from a given population. Similarly, Sun (2012) uses IBS proportions for relatedness research by plotting p 1 versus p 0 . In fact, any combination of the three proportions could be plotted for relatedness research. We refer to these graphics as (p i , p j )-plots (for i, j = 0, 1, 2 and i < j) were p i corresponds to the X-axis of the plot and p j to the Y-axis.

| Ternary diagrams
Let p be the vector (p 0 , p 1 , p 2 ) of proportions of the IBS counts.
Because the three components of p sum to one (p 0 + p 1 + p 2 = 1), we can plot the vector p in a ternary diagram. Mathematically, the set of the vectors of proportions p = (p 0 , p 1 , p 2 ) forms the simplex, S 3 . Figure 3 shows the ternary diagram for the vectors of proportions for the five Maya pairs (Table 1) 2.4 | ilr-plots Aitchison (1986)  used ilr-coordinates z 0 , z 1 and z 2 of a vector of proportions (p 0 , p 1 , p 2 ) are given by

| IBD STUDIES
Studies of relatedness based on IBD alleles are based on the probabilities that a pair of individuals shares 0, 1 or 2 IBD alleles. These probabilities are commonly referred to as Cotterman's coefficients (Cotterman, 1941) and denoted by the vector of proportions k = (k 0 , k 1 , k 2 ). Table 2 shows the values of the Cotterman coefficients for some standard relationships. Cotterman's coefficients can be estimated by the maximum-likelihood method (Milligan, 2003;Weir, Anderson, & Hepler, 2006). The maximum-likelihood estimates reveal the most likely relationship for a pair given the observed genotype data. Let R represents a possible relationship between two individuals with genotypes G 1 and G 2 , respectively. The likelihood of R is defined by the probability of observing G 1 and G 2 given relationship . More details are explained by Wagner, Creel, and Kalinowski (2006). Under the assumption of absence of inbreeding, the inequality k 2 1 ! 4k 0 k 2 applies and constrains the Cotterman coefficients (Thompson, 1991). Analogously to the vector of proportions p = (p 0 , p 1 , p 2 ) of the IBS counts, Cotterman's coefficients also satisfy k 0 + k 1 + k 2 = 1.
We can use the same graphical techniques described for p = (p 0 , p 1 , p 2 ) to identify relatedness from the estimated Cotterman coefficientŝ k. The Cotterman coefficients can be represented in a ðk i ;k j Þ-plot, in a ternary diagram or in an ilr-plot with the ilr-coordinates z 0 , z 1 and z 2 , defined in the Equation (2) Albrechtsen (2014) use the ðk 1 ;k 2 Þ-plot. The remaining possibility, the ðk 0 ;k 2 Þ-plot, could be also considered. Figure 5a shows the plot for the five Maya pairs (Table 1). The grey curve in the ðk 0 ;k 1 Þ-plot corresponds to the equation k 2 1 ¼ 4k 0 k 2 . This curve jointly with the hypotenuse and the vertical axis delimits the feasible region k 2 1 ! 4k 0 k 2 . PO pairs are points located on the k 1 -axis with values close to 1, FS pairs are located close to the centre of the grey curve according to the theoretical IBD probabilities (Table 2) and second and third degree pairs are located around the centre of the hypotenuse. UN pairs theoretically have k 0 = 1 and are located between the hypotenuse and the grey curve, near to the vertexk 0 ¼ 1.
Finally, the origin of the ðk 0 ;k 1 Þ-plot is the position for any MZ pair.
As previously shown for IBS studies with the (p i , p j )-plots, only two of the three Cotterman coefficients are plotted and the relative positions and distances between points vary depending on the (k i ;k j )-plot used. For this reason, we propose graphics from CoDA.

| Ternary diagrams
The theoretical IBD probabilities for the standard family relationships can be represented in a ternary diagram (Thompson, 2000). These probabilities form reference points against which the empirical estimates can be compared. Figure

| ilr-plots
It has been shown that the maximum-likelihood estimates of the

| CASE STUDY
We applied all the graphical methods detailed in the previous sections using empirical data extracted from a world-wide data set from the Noah A. Rosenberg Research lab at Stanford University (Rosenberg et al., 2002). This world-wide database is derived from the Human Genome Diversity Cell Line Panel (HGDP, Cavalli-Sforza, 2005). The genetic information is given by 377 microsatellites geno- presented throughout this article are made with the R software (R Core Team, 2015) using the R packages ggplot2 (Wickham, 2009) and ggtern (Hamilton, 2015). Figure 7 shows all IBS graphics for all pairs of the Maya population.

| IBS graphics
In the ð x; sÞ-plot (Figure 7a), the points with the smallest standard deviation close to the grey curve are two PO pairs. The relationships of first and second degree are the points with a mean above 1. Note that some pairs of FC are mixed with UN pairs. Figure 7b (the (p 0 , p 2 )-plot) clearly separates the family relationships of first and second degree from the UN pairs. In the ternary diagram (Figure 7c), PO pairs are points on the opposite side of the vertex p 0 , meaning that the p 0 is close to 0. The FS pair is the point closest to the vertex p 2 , which has the largest p 2 ; the violet points represent the family relationships of second degree are separated from the green points representing UN pairs. In Figure 7d, the first ilr-coordinate (z 11 ) clearly discriminates first-degree relatives from UN pairs. Pairs with larger values for z 11 are more likely to correspond to related individuals. PO pairs are extreme outliers because they have p 0 values close to 0 which increase the first coordinate of the corresponding log-ratio. The scatterplot of the log-ratios is seen to produce a larger degree of separation between FS and PO pairs, and between first-degree relationship pairs and all other pairs. The convex hulls for the simulated related pairs in Figure 7 are seen to enclose the sample estimates of the PO, FS, HS and FC pairs and so confirm the assigned relationships.

| IBD graphics
We estimated IBD probabilities for all pairs of the Maya population.
All IBD graphics are shown in Figure 8. The ðk 0 ;k 1 Þ-plot (Figure 8a) separates the first, second and some pairs of third degree of relatedness. In the ternary diagram ofk (Figure 8b) ships. Moreover, as has been noted in Section 2, the Euclidean distance between two pairs in a (p i , p j )-plot is not invariant with respect to the chosen index (0, 1 or 2), for example, is not the same in a (p 0 , p 1 ) and a (p 0 , p 2 )-plot. ðk i ;k j Þ-plots have, in comparison with (p i , p j )-plots, the advantage that fixed reference positions for the standard relationships exist, as given in Table 2. This is of great practical value when inferring relationships. Moreover, IBD plots are more reliable for classifying relationships because they show a larger degree of separation between the different relationships than their Ilr-coordinates: z 1 ¼ ðz 11 ; z 12 Þ IBS counterparts. This is clearly visible when one compares Figures 2 with 5a, 3 with 5b, 7b with 8a and 7c with 8b. However, the IBDbased ðk i ;k j Þ-plots suffer from the same problem as their IBS counterparts: the Euclidean distances between pairs (and reference points) depend on the index (0, 1 or 2) that is used.
We comment on some peculiarities of the HGDP-CEPH database analysed in the article. We found the high estimate of k 1 (0.27) in