Web-based supplementary materials for “Computation of ancestry scores with mixed families and unrelated individuals”

Bioinformatics Research Center Department of Statistics and Biological Sciences North Carolina State University, Raleigh, North Carolina, U.S.A. J.S. Marron Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, USA. Fred A. Wright Bioinformatics Research Center, Department of Statistics and Biological Sciences, North Carolina State University, Raleigh, USA. *email: yihui zhou@ncsu.edu


Family member identification
Standard results for shared genotype probabilities for related individuals are expressed in terms of kinship coefficients and identity-by-descent probabilities. Here we clarify, as is needed for this paper, the correlation of genotypes between (say first-degree) relatives. We focus on siblings, although a slight modification of the argument applies to parent-child relationships. Let q denote the minor allele frequency, and a pair of siblings have random genotypes g 1 and g 2 , with means 2q and variances 2q(1 − q). We have corr(g 1 , g 2 ) = E(g 1 g 2 ) − E(g 1 )E(g 2 ) SD(g 1 )SD(g 2 ) = E(g 1 g 2 ) − (2q) 2 2q(1 − q) .
If IBD=1, without loss of generality, we assume the shared allele comes from the mother.
All of our proposed methods can use families identified by external software such as KING (Manichaikul et al., 2010), but we found it convenient to utilize a method within our same R environment. For row-scaled genotype matrix X, we expect family members to have high correlation. However, population stratification distorts interpretation, so we first compute the n × n correlation matrix of X, then forcing the maximum off-diagonal value to be 1/8. The spacing of successive eigenvalues from the resulting clipped correlation matrix are examined on the log-scale, identifying the matrix P of the top k eigenvectors that exceed 4 standard deviations from the mean spacing. The matrix X resid = X − XP P T is residualized for gross population stratification, and then X resid is again row-scaled to obtain X , and families identified using entries for which the n × n correlation matrix of X exceed η. We use η = 0.1 in this paper, effectively identifying first-and second-degree relatives. More careful investigation of higher-degree or cryptic relationships can use software for this purpose, beyond our intended scope.

Covariance-Preserving Whitening (CPW)
Here we describe an approach to modify the genotypes within families so that the final covariance matrix equals the modified covariance matrix M used for the matrix substitution approach, and families are orthogonalized while retaining their covariance with the remaining sample. The goal here is to find an n × n matrix B such that Y = XB T and 1 where the entire sample, including all families, is handled at once. There are multiple possible solutions, but it is appealing to add the constraint that only family members be modified, as singletons do not contribute to the problem of "spurious" ancestry scores. We assume that the columns of X are arranged with singletons S followed by families F. We then divide M (defined above) and B T into submatrices as follows, where n F individuals belong to the all-families set F, and I n−n F denotes an (n−n F )×(n−n F ) identity matrix. Note that M differs from M in only the co-family pairs of the lower right submatrix, and we will use M 22 to denote the corresponding n F × n F lower right submatrix of M . The form of B T , with the identity submatrix operating on the singletons in Y = XB T , achieves the desired constraint that singletons be unchanged. C and D are unknown matrices, to be solved for. We show below that the solution for full-rank X is , with a slight modification to account for our situation that X has rank n − 1. For Y = XB T , ancestry scores are obtained as V CP W in the SVD Y = U D V T CP W . Figure S1 (right panel) shows the result of applying covariancepreserving whitening to the CF data. The plot shows that, for the new matrix Y , crosscorrelations of families F to singletons S have indeed been preserved from the original X.
In fact even the correlations between members of different families have been preserved (not shown).

The Covariance-Preserving Whitening Solution
We have Comparing the last two expressions provides two equations in the two unknowns C and D.
From the upper right, we have M 11 C + M 12 D = M 12 , which implies C = M −1 11 M 12 (I 2 − D) (the lower left is the same equation written in transpose form). The lower right requires a bit more effort. We consider each of the four terms separately, plugging in the solution for C from above, giving and d does not simplify. We have Thus, finally, we have our solution The final expression is as desired, preserving singletons while rotating only the family members. The solution is unique if X T X is of full rank n. However, in our treatment, X has been row-centered, so no exact solution exists. To prove this by contradiction, suppose A exists such that AM A T =M . When X is row-centered, M has rank n − 1, and the rank of the left-hand side cannot exceed n − 1. However, when matrix substitution is implemented in practice, the resultingM typically has rank n, creating a contradiction. In practice, when X has been row-centered, we add a small value δ = 0.001 to the diagonal of M before proceeding, which provides similar results to using a Moore-Penrose generalized inverse when solving C and D. Either approach results in 1 p−1 Y T Y as a close approximation to M in simulations and for the real CF data.

Geometric Rotation / Family Whitening (FW)
We first describe the problem in geometric terms, to gain an understanding of the nature of the modification, and follow with the simple matrix operation analogue. Our solution is to rotate the data to make individuals within a family orthogonal, performed within a plane such that the impact of the data rotation is otherwise minimal. The approach is easiest to explain for a family of size 2, and the data for each individual is the scaled genotype p-vector. Data vectors for first-degree relatives are expected to have a 60 • angle, corresponding to a genotype correlation of 0.5 (Appendix A). We first find the mean vector of the two members, and then rotate each member away from the mean vector to a target angle of 45 • . This operation makes the new vectors orthogonal, which is approximately true for unrelated individuals.
In general, a family f consists of n f individuals indexed by the set F f . The target rotation angle θ f is the same as the angle in R n f between each coordinate unit vector and the In the plane determined byz F and z j , the unit vector with angle The vector x .j = µ j ||x .j || is the natural rescaling of µ j , and used as a replacement data vector for x .j . Finally the data vector for each family member is centered and rescaled to match the mean and variance of the original data. This rotation operation is conducted in succession for each family f = 1, ..., F , and SVD is applied to the new whitened data matrix.
Geometric rotation has a matrix operation interpretation, de-correlating the members of a family f by an operation similar to classical multivariate sphering. Let Z F f be the p × n f submatrix of scaled family genotype data, and R F f the corresponding (positive definite) identity correlation, and a final X In practice, geometric rotation and matrix whitening of the family are nearly identical, with slight differences due to handling of column centering, and the matrix approach is used subsequently. Figure S1 (left panel) shows the result of family whitening in the CF dataset, in terms of the correlation of columns of X F compared to those of X S . This shows that the family whitening operation introduces some perturbation of the correlation structure. We will return to this issue below.

Rationale for predicting true ancestry from ancestry scores for association covariate control
We suppose Z is an n × p contains the genotype data at a SNP, as well as any important covariates, such as age, sex, etc., for a total of p covariates including the intercept. Let A be the (unobserved) true ancestry matrix for q strata, with each entry representing the proportion of each individual's genome from each of the q strata. In the special case of no admixture, then A has binary entries. A standard linear model would be where β and α are coefficients and E are random errors. Now, A is unobserved, but we believe is represented by the q-d subspace generated by the ancestry scores V , then where C is a coefficient matrix representing "regression" of each of the q strata on the q ancestry scores. Following this assumption, where δ = Cα is a coefficient vector. An investigator has access to observed Z and V , and fits the final model. As long as (5.1) holds, then the regression is valid, even though we don't observe A directly.
The rationale above holds for generalized linear models, and thus includes logistic modeling, for example for case-control modeling. In practice, the mapping of V to A is not perfect, and even a relatively small error in prediction of A from V can degrade the covariate control effectiveness. However, (5.1) does serve as a rationale that the ability to linearly A from V is key, which can be summarized by multiple regression R 2 for each column of A from matrix V .

Individual scree plot illustration
The individual scree plots use loess smoothing of the individual scree values. Figure  [ Figure S2 about here.]

Gaussian simulations
We first illustrate the principles for our procedures using simple Gaussian simulations, so that the pure effect of individual-individual covariance structures can be demonstrated.
Each simulated dataset consists of p = 10, 000 markers and n = 800 individuals, including n S = 200 singletons, and n F = 600 individuals arising from 300 family pairs. An initial p × n error matrix E was first generated, reflecting family correlation structure but not population structure. For each singleton, the corresponding length-p column was generated as independent N (0, 1) entries. For the (arbitrary) first member of a family, the corresponding column e .1 was generated in the same manner, for each marker i as e i1 ∼ N (0, 1). Then the second member of the same family was generated for each i as e i2 = 0.5e i1 + i , where i ∼ N (0, .75). Thus finally all entries were marginally N (0, 1) and family pairs had correlation 0.5, corresponding to "first-degree" relatives.
Population structure was generated by simulating p-vectors P k ∼ N (0, .01), where k = 1, 2, 3, and randomly assigning each singleton and each family to a subpopulation with probability 1/3. A final matrix X was generated by adding, for each individual, the corresponding column of E to the appropriate subpopulation column vector from P , and X was again rowscaled.
We divided the family pairs into an arbitrary partition (set 1 vs. set 2), such that no members were related within a set. The various methods were represented as follows, and shown in Figure S3  Similar results can be observed for an unbalanced simulation, in which the total sample size is fixed at 800, but 100 families all belong to a single population stratum ( Figure S4).
Here the shrinkage is not as strong, as the number of family members is a smaller proportion of the total sample size.

Idealized simulations
We simulated genotype data in a manner that respected local correlation structure, which is present but typically modest in SNPs used for stratification control, and reflected population ancestry. A SNP "block size" of 20 was chosen. An autoregressive normal model was used to simulate a set of modestly underlying correlated values, e.g. for one individual the value for the ith SNP is Z i = ρZ i−1 + , where ∼ N (0, 1 − ρ 2 ), followed by reversal of sign of ρ with probability 0.5. Marginally, each Z i ∼ N (0, 1), and a modest ρ = 0.2 was used within each block and ρ = 0 at block boundaries, so that values across different blocks were uncorrelated. To convert the values to genotypes, we first generated random minor allele frequencies by drawing "ancestral" allele frequencies from the half-triangular distribution each SNP and individuals in subpopulation k with allele frequency q k , an inverse quantile of ranked z-values such that the lowest z values were converted to genotype 0, the largest to genotype 2, and genotypes 0, 1, and 2 occurred with frequencies (1 − q k ) 2 , 2q k (1 − q k ), and q 2 k (i.e. Hardy-Weinberg equilibrium within each subpopulation k).
[ Figure  For each subpopulation the number of haploid genomes is twice the number if individuals.
To create realistic "pure" subpopulations, we followed the basic approach of HAP-SAMPLE Wright et al. (2007), in which phased autosomal haplotypes were selected from the pool of 4000 haploid genomes, and subjected to an artificial recombination process to create new haploid genomes. The new haploid genomes were then used as unrelated individuals, or as founder individuals in each 7-individual family as described in the main text.
The artificial recombination process followed a simple 1Mb=1cM assumption, for an ap-cessive SNPs, crossover events are rare enough that this simple genetic map approach was effective and preserved the linkage disequilibrium structure, while adding novelty in terms of the genomes that could be generated. For each genome to be simulated, a target probability profile was selected, representing the proportion of the genome to be selected from each of the K = 20 subpopulations. For each simulated crossover event, the appropriate subpopulation for that portion of a simulated chromosome was chosen as a multinomial outcome, according to the target probability profile. Once the subpopulation was chosen, the particular haploid segment was chosen at random from among all haploids in the corresponding subpopulation pool.
For the pure subpopulation scenario, each unrelated individual and family founder was given a target probability of 1 for once of the K = 20 subpopulations, with a probability of 1/K for each subpopulation. The original individuals in the pool show some withinsubpopulation variation, to varying degrees depending on each subpopulation. However, the simulation process produces individuals with no such variation, except for stochastic sampling variability due to the haploid segments.
For the admixture scenario, a single subpopulation was chosen for each individual as above, and then a secondary subpopulation at random from among the remaining K − 1 subpopulations. The proportion of the genome from the first and second subpopulations was then assigned as 1 − p, p, respectively, where p was chosen from a beta(1,10) density (i.e. mean admixture =1/11, and about 10% of the individuals showed admixture greater than 20%).
The above approach was used to simulate all unrelated individuals, as well as the three founders in each 7-individual family. The remaining members of each family were simulated using the same artificial recombination process, but based on the haplotypes of their parents. Figure S1: Cross-correlations between genotype vectors of set S vs. F for the cystic fibrosis data. On each axis, a point represents a correlation between an individual in S to an individual in F, for a total of 2546× 898 points. Left panel: Cross correlations of X S × X F vs. cross correlations of X S × X F show modest deviation. Right panel: Cross correlations of X S × Y F vs. cross correlations of X S × X F show that the goal of covariance-preserving whitening is achieved. Figure S2: Several representative individuals, with noisy individual scree curves (grey), and corresponding loess fits with colors following the same scheme as in Figure 6 of the main manuscript. Figure S3: Illustration of family shrinkage for various methods, idealized normal data with 200 singletons (black) and 300 family pairs, divided into set 1 (gold) and set 2 (cyan). The first two eigenvectors are shown, with empirical 95% normal confidence ellipses for each subpopulation.