Gene‐based and pathway‐based testing for rare‐variant association in affected sib pairs

Abstract Next generation sequencing technologies have made it possible to investigate the role of rare variants (RVs) in disease etiology. Because RVs associated with disease susceptibility tend to be enriched in families with affected individuals, study designs based on affected sib pairs (ASP) can be more powerful than case–control studies. We construct tests of RV‐set association in ASPs for single genomic regions as well as for multiple regions. Single‐region tests can efficiently detect a gene region harboring susceptibility variants, while multiple‐region extensions are meant to capture signals dispersed across a biological pathway, potentially as a result of locus heterogeneity. Within ascertained ASPs, the test statistics contrast the frequencies of duplicate rare alleles (usually appearing on a shared haplotype) against frequencies of a single rare allele copy (appearing on a nonshared haplotype); we call these allelic parity tests. Incorporation of minor allele frequency estimates from reference populations can markedly improve test efficiency. Under various genetic penetrance models, application of the tests in simulated ASP data sets demonstrates good type I error properties as well as power gains over approaches that regress ASP rare allele counts on sharing state, especially in small samples. We discuss robustness of the allelic parity methods to the presence of genetic linkage, misspecification of reference population allele frequencies, sequencing error and de novo mutations, and population stratification. As proof of principle, we apply single‐ and multiple‐region tests in a motivating study data set consisting of whole exome sequencing of sisters ascertained with early onset breast cancer.

because tests involving sib pairs have been shown to be more powerful than testing an equivalent number of cases and controls (Epstein et al., 2015;Sha & Zhang, 2015;Teng & Risch, 1999;Zöllner, 2012). From a design perspective, comparisons using siblings provide a natural way to control for many potentially confounding covariates, both genetic and environmental.
Tests for association of rare variants (RVs) with binary traits using affected sib pairs (ASPs) treat the count of RV alleles as the outcome variable. The idea developed by Epstein et al. (2015) is that rare susceptibility alleles will appear more frequently on haplotypes shared identical by descent (IBD), compared to those not shared IBD. Thus, regressing the rare allele count in a region on the corresponding IBD information for that sib pair is one way of testing for association within the region. While this approach to analyzing the ASP design is shown to have good properties in reasonably large samples, our investigations of relationships between rare allele counts and haplotype sharing have led us to alternate, more refined test statistics. Rare alleles appearing in duplicate in a sib pair will very likely be shared IBD; single rare alleles, that is appearing only once, will certainly be nonshared. Similar reasoning to that above suggests that duplicate alleles should be enriched in susceptibility regions. In this report we demonstrate through extensive simulation studies that this alternative counting method leads to more powerful tests of association than regression on IBD. We develop two tests at the region level, and extend them to test at the pathway level. Overall, the aim of our approach is to increase power to detect weaker signals, such as medium to low penetrance variants clustered in a region; or very rare, family-specific mutations that operate through a shared disease mechanism (a pathway).

| METHODS
Assume we are testing a genomic region which has been filtered on minor allele frequency (MAF) information from population reference panels (e.g., 1000 Genomes, Exome Sequencing Project [ESP 6500], UK Biobank). This produces j = 1,2, …, R loci with rare alleles (e.g., defined as MAF < 0.1%). For a study of N families each with two affected siblings, define Q ij to be the number of copies of the rare allele at locus j for sibpair i, so that ∈ Q {0, …, 4} ij ; and define ∑ . Also, let Z ij denote the number of alleles shared IBD for sibpair i at locus j Z ( = 0, 1, 2) ij . We assume no recombination within a region, and for ease of notation, drop the subscript j from Z ij unless otherwise specified. Although the method we develop specifies families with two affected siblings, the analysis can accommodate families with more affected sibs, by including all pairs of siblings as separate ASPs. For application to datasets with many large sibships, valid variance estimation might entail an adjustment for familial correlation.
Initially, we are interested in testing for a signal in a single, contiguous genetic region. This case is most commonly assumed in the RV association literature, and often corresponds to testing at the gene level (Derkach, Lawless, & Sun, 2014;S. Lee, Abecasis, Boehnke, & Lin, 2014;Wu et al., 2011, and others). Gene-level testing reduces the multiple testing burden of marginal testing at each SNP. This benefit can be further extended if multiple genetic signals are captured in a pathway, that is a collection of genetic regions related by biological role or function; we discuss this subsequently. Epstein et al. (2015) model the dependence of • Q i on Z i , as summarized (in our notation) via the following regression equations:

| Epstein's test
which assume that rare allele counts have a different mean μ μ ( , ) 0 1 and variance σ σ . A brief summary of the derivation is provided in Appendix A.

| Allelic parity test
At the sib-pair level, we define , i N = 1, …, , which sum rare allele counts across a haplotype for single copy and duplicate variants, respectively. Here, we let μ j be the frequency of the rare alleles at locus j j R ( = 1, …, ) in the source population (assumed known, for now). Further, denote ∑ μ μ = j . We express the means and variances of ROMANESCU ET AL.
| 369 S i and D i in terms of the μ j , conditional on haplotype sharing, under the null hypothesis that there are no susceptibility variants in the testing region. These derivations are presented in Appendix B, and results are summarized in Table 1; here, τ τ l , , = 0,1,2 l D l S denote the conditional means of D i and S i , given etc.). Parameter k, which will be estimated from study sample data, is introduced to account for within-region linkage disequilibrium (LD) in the variance computation and acts as an overdispersion factor, that is arising from positive correlations between RVs within the region.
Under the null we expect no systematic differences between the MAFs in affected versus source populations; and μ j aff is the frequency of the rare allele at locus j in the affected population. Under the alternative of some variants in the region being penetrant, the ascertainment of the study sample will be reflected in a higher count of rare alleles in the region, that is ⋅ Although an exact quantification of such increase will depend on the genetic model-which is assumed unknown-it is nevertheless possible to make qualitative observations. In particular, while we expect an enrichment in both single and duplicate counts, the frequency of duplicate alleles will increase proportionately more than the frequency of single alleles. This occurs because siblings that share a susceptibility allele are more likely to be both affected, and hence ascertained into the study, compared to pairs where one sib is an affected carrier and the other is an environmental case, or where siblings carry different susceptibility alleles. Table 2 illustrates that the increase in D from the null to the alternative is greater than the increase in S. The numbers in each cell are expected sums of S i 's and D i 's over the entire sample under the null, stratified by IBD state. These are obtained by multiplying the means in Table 1 by the expected number of samples in each Z i category, that is by The shading in each cell signifies the expected increase in that count under the alternative compared to the null (darker shading means a higher proportional increase). With this setup, a test statistic for evidence of association has the general form are contrast weights in the comparison of the different Z i strata. As discussed above, one version of this test statistic contrasts the columns of Table 2 , which has a mean of zero at first order of μ j , assuming no RV association and no excess IBD sharing (Appendix B). Standardizing this expression leads to under the null. We call this the allelic parity statistic, because it is based on the parity relation for RVs under the null that expected counts of duplicates are half the counts for singles. The overdispersion k is estimated from data as where s S 2 and s D 2 are the sample variances of S i and D i , which reflect covariances among the RV loci, and provide robustness to within-region LD. Appendix B provides detailed derivations.
T A B L E 1 Means and variances for counts S i and D i conditional on identical by descent sharing (Z i ) Note: Expressions are accurate to second order in μ j . The μ j parameter is the MAF of the variant at locus j, which we assume to be known. Values can be determined from external reference population panels, and ideally the genomic panel closely matches the genetic characteristics of the source population for the ascertained ASPs. If, however, the general level of enrichment in all RVs across the genome is consistently and substantially elevated in the sample compared to reference panels, then a genome-wide correction may be necessary to account for systematic differences; we discuss what such a correction might be when we consider robustness to misspecification of MAFs.
We also formulate a version of T ap that is self-contained, in that it does not use externally supplied μ j . This follows from estimating the sum of μ j empirically by

| Weighted allelic parity test
Based on Table 2, it is possible to distill a more powerful test by contrasting only the strongest signal, that is the D i for Z = 1 i or 2, with the corresponding mean under the null of no RV association. Because the variances of D i in the two strata are different, to increase efficiency we apply inverse variance weights to the contribution of the strata (i.e., by the inverse standard deviation of D i given Z i ). The test statistic we obtain is where n Z =1,2 i stands for the number of sib pairs with Z = 1 i and 2, and k is computed as above.

| Pathway extensions
In the multiple region case, assume we have a collection of p different genetic regions (e.g., comprising a pathway), and each of these has R q rare variants after filtering, q p = 1,2, …, . Quantities S i and D i are defined in a similar way as above, but are now specific to a genomic region denoted by an extra subscript q q p , = 1, …, , that is S qi and D qi . Also let , that is the sums of these quantities across the entire pathway, for one sib pair. The multiple region allelic parity test statistic has a similar form as in the single region case, namely , obtained by similar reasoning (the notation Σ j q , is shorthand for the double summation in the previous formula). Note that s S 2 q and s D 2 q are computed as in the single region case, using all S D i N and , = 1, …, qi qi from region q. From this, the empirical version can be obtained similarly as above, For the weighted test, a multiple region statistic can be obtained by adding the contributions across regions as well as across families. Keeping in mind that the observed IBD sharing of a sibpair can change from one region to the next, the derivation is similar to the single region test, leading to the expression where the k used for computing T q p , = 1, …, , ap w q − is the one given immediately above.
We note that there is no pathway extension for Epstein's test; however, a simple approximation can be constructed by regressing allele counts on IBD state-we call this the "regression test"-and it can be easily extended to test a pathway. See Appendix A for details.

| Robustness of allelic parity statistics to linkage and LD
Under the null hypothesis of no RV association within a region, we expect evidence for excess IBD sharing in the region to be unusual, although it is possible that excess sharing might be observed when the test region is close enough to be in linkage with a common variant susceptibility locus, but far enough away that the RVs are not in LD with it. All three test statistics use the k parameter to account for within-region LD. In Appendix B, we show that linkage can inflate the allelic parity comparison but the bias will be negligible unless the set of RVs is exceptionally large. We conclude that T ap and T ap emp − are reasonably robust to linkage, but as a precaution, we recommend that IBD sharing estimates be examined for regions suspected to harbor susceptibility genes. On the other hand, the T ap w − statistic derives from conditional means and variances of D i given Z i , so does not depend on IBD sharing values and thus is fully robust to the presence of linkage. This advantage, however, may be countered by lack of relevance or imprecision of the external population frequency μ j values that can introduce bias into T ap w − . In a series of simulation studies reported in the next section, we compare validity and power of the affected sibpair RV test statistics of interest under various design parameters, and investigate robustness of T ap w − to MAF misspecification. Because all test statistics based on observed allele counts may be adversely affected by sequencing errors or the occurrence of de novo mutations, we also investigate robustness of methods to these practical issues. Finally, we evaluate the consequences of defining sets of RV with less rare MAF.

| Design
Starting with 594 European haplotypes from the 1000 Genomes Project, we simulate a genetic region to be tested for association; the region, of length 13.6 kb, is taken arbitrarily from chromosome 1. Because the minimum MAF that can be simulated using the samples from the 1000 Genomes European haplotypes is 1/594 = 0.17%, to generate variants that are more rare, we first filter variants on MAF < 0.2%, and then add a sufficient number of "noncarrier" families (i.e., families with haplotypes containing the wild type variant at all of the rare loci) to bring the MAF of the entire pool of parental haplotypes below the desired threshold of 0.1% for all variants. We generate families of parents with two offspring using R package "sim1000G," which assumes random haplotype pairing, random mating, and Mendelian inheritance (Dimitromanolakis, Xu, Krol, & Briollais, 2019).
Under the alternative hypothesis of RV association we generate age at onset for each individual offspring via a proportional hazards model with rate where h 0 (t) is the baseline hazard function, which we specify as Weibull, and t 0 is a minimum age of disease onset set to age 20. X is the individual-level genotype vector indicating carrier (1) or noncarrier (0) of the rare allele at each of the R rare loci. Among these, there are C susceptibility loci, where C represents 15% of all RVs in the region, chosen at random. The parameters β j , j = 1, …, R correspond to effect sizes for RVs in the region (so that β j > 0 for all of the C susceptibility variants, and β j = 0 for the R − C nonrisk variants). We draw a family from the population pool of size 500,000 families, and apply the PH model (1) to generate the age at onset for each of the siblings. The model (1) is implemented in R package "FamEvent" (Choi, Kopciuk, He, & Briollais, 2017), and returns the cumulative distribution function (cdf) of the age at onset for one individual. The onset age is simulated for each of the siblings independently as the inverse cdf computed at a uniform random variate, under the individual PH model. We define "affected" as disease onset before age 50, and ascertain a pair into the study if both siblings are affected. The procedure of drawing from the pool and ascertaining is repeated until the target sample size N is obtained. The distribution of observed rare allele counts per sib pair in a single region, which is heavily weighted toward counts of zero, becomes visibly heavier in the right tail following ascertainment under the alternative ( Figures S1 and S2).
Under the null scenario of no association, genotypes are simulated before ascertainment using "sim1000G" as detailed above, but none of the RVs are designated to be susceptibility loci in the region; effectively β = 0, j for all j R = 1, …, . This null is region-specific, not global; in practical application, affected families without any susceptibility alleles in the region could be environmental, or genetic cases arising at some other region. To improve computational efficiency for the intensive null simulations, we do not generate age at onset for the offspring, but instead automatically ascertain families into the study. This is correct because phenotype is independent of genotype under the null hypothesis of no RV association in the region.
Under a pathway scenario, we generate RV genotypes in two genetically independent regions 1  and 2  on chromosomes 1 and 3, which are assumed to form a functional pathway. Extending the single region approach, there are C 1 and C 2 different risk variants in each region, representing 15% of RVs in each region, respectively. Their joint effect is captured through the function • g ( ) in the genetic model: . Under the null hypothesis, there is no RV association in either region. Under the alternative, we consider two different genetic architectures. In an additive model suggested in P. I. Lin, Vance, Pericak-Vance, and Martin (2007), the effects of deleterious alleles are added across regions, although it is rare for one family to carry more than one such RV. In the epistatic model of Marchini, Donnelly, and Cardon (2005), rare susceptibility alleles are required at both genes for loss of function to occur (in this case, one gene acts as a "modifier" to the other). Since this scenario occurs rarely, we are more permissive with the MAF filtering to obtain a visible effect.
To compare performance of the association tests, we apply them in each data set generated under a null or an alternative genetic model; replicated datasets are drawn independently under single-region and multiple region mechanisms, for combinations of four study sizes N (20, 100, 500, and 1,000 families), and various effect sizes (Table 3). Going forward we drop the original version of the allelic parity test and only include the empirical and weighted versions. We found the original version to have similar performance characteristics to the empirical version, but the additional requirement to specify allele frequencies, as well as lower power compared to the weighted version makes its use less appealing.

| Validity and power
To assess type I error control, the observed p values (−log 10 transformed) for each test are plotted in Figure 1 versus those expected under the null, for N = 100 and 1,000, using 100,000 replicates. We see that the test size is well controlled. Plots for N = 500 show similar behavior; for N = 20 the empirical allelic parity test is conservative in the tail, however, the weighted version works well ( Figure S3). For power calculations, we employ 10,000 replications, and estimate power as the fraction of tests that reject the null at level α, for data sets generated under the alternative models specified in Table 3. Power curves for a sample size of 500, evaluated at significance criteria α = .05 and .0005 ( Figure 2) show that the allelic parity test-especially the weighted version-is more powerful by a factor of 2-10 compared to regressionbased tests. A more dramatic display of power differentials occurs at stricter significance levels (Figure 3), where the ratio increases with decreasing α. We observe that test rankings according to power do not depend on sample size ( Figure S5).
Results for two-region pathway testing are similar to single-region testing under the null and additive models ( Figures S4 and S6). Unsurprisingly, the power is higher in general for pathway testing compared to region testing T A B L E 3 Genetic models used to generate ascertained datasets in the power simulations

Model
Description Simulation settings Population MAF

<0.001
Multiple region, Epistatic model | 373 ( Figure 2). In particular, it is encouraging to see that a pathway with highly penetrant variants (HR = 8) can be detected in a sample as small as 20 sib pairs, with power just above 50% ( Figure S6). Under the epistasis model, power is generally low, as expected. Still, the weighted allelic parity test performs visibly better than the other tests ( Figure S7). Finally, all results shown in the text refer to one-sided tests. This is sensible at the genome-wide level when testing single regions for deleterious variants. It is also possible to perform two-sided tests, if we have reason to believe that in certain regions, RVs could be primarily protective.

| Robustness
We perform additional simulation studies to evaluate practical consequences of misspecification of reference population parameters, sequencing errors and de novo mutations, as well as sensitivity to rare variant criteria.
F I G U R E 2 Power curves for testing at α = .0005 (left) and α = .05 (right) for a sample size N = 500, and 10,000 replicated datasets.
Results for single region (top panels) and two-region pathway under the additive model (bottom panels). The horizontal black lines represent the significance threshold α F I G U R E 3 Power of single region testing versus significance threshold for medium penetrance variants (HR = 4) for sample size N = 500 sib pairs, and 100,000 replicated data sets (1) Misspecified external MAF estimates (the μ j 's) in the weighted allelic parity test. These evaluations generate random errors for the reference population MAFs, with random μ j 's drawn independently from an exponential distribution under three scenarios. The exponential mean is taken to be, in turn, underestimated (by a factor of 2) compared to the true MAF used in the prior simulations, equal, or overestimated (by a factor of 2). As might be expected, the test is liberal for under-estimated MAFs and conservative for over-estimated MAFs ( Figure S8 and Table S1). For unbiased MAFs, type I error is well controlled for sample size N = 100, but becomes liberal with larger N. We suggest two relatively simple remedies to deal with misspecified μ j 's. The first is to use an adjusted null distribution, which is similar to the concept of an "empirical null" distribution from Efron (2004) (See Supporting Information Methods for details). Application of this strategy to the simulated misspecified test statistics yields an obvious improvement in performance. Type I errors become well controlled using the empirical null approach in the under-estimated and unbiased scenarios, and only slightly conservative for over-estimated μ j , and power estimates also become close in all three scenarios ( Figure S8 and Table S1). An alternative remedy is to scale the μ j estimates by a common factor so that they become unbiased in distribution. This approach is suitable when an empirical null distribution may not be available, such as when performing only one or a few tests. We show an example of using a scaling factor in the Application section.
(2) Sequencing errors in next generation sequencing and de novo mutations. Sequencing errors can cause base substitutions, which will appear as rare variants in the data used for analysis, and induce inaccuracies in RV tests if the error rate is high enough. We simulate sequencing errors and add these extra "rare alleles" to the genetic data, for prespecified error rates of 10%, 25%, and 50% among the observed rare variants. As a quality control step, we flag as errors and remove all alleles that contradict the sharing state for that particular locus and sib pair; this is only possible for an IBD state of 2 and an odd valued Q ij . De novo variants similarly add rare alleles that are not part of Mendelian inheritance. The same simulation setup is applicable to de novo mutations, except that, because they are quite rare (perhaps only 30 per genome), they have a less material impact. For all tests, Type I error and power (Table S2) decrease with increasing error rate. The empirical allelic parity test is the most affected by errors and the weighted test is least affected. This is sensible, since errors will inflate the S i counts of single alleles, whereas the D i 's would require an error to occur at the site of an pre-existing rare allele, which is less likely. We note that even though all tests become conservative, the weighted allelic parity test retains high power (above 95%) even for an error rate of 50%. We recommend this test for use in the presence of sequencing errors. We also note that bioinformatics tools may be able to weed out common sequencing errors; for instance, Ma et al. (2019) report an approach that can dramatically reduce the A > T substitution error rate in deep sequencing data.
(3) Comparative test performance for low frequency variants. The development of our methods was motivated by the aim of uncovering very rare variants (MAF < 0.1%), but the tests can be applied with more common variants as well (e.g., low frequency variants). Type I error simulations for MAF < 3% and MAF < 5% show that, compared to the regression tests which retain close to nominal type I error control, the allelic parity tests are conservative, but not extremely so. All tests tend to have higher power for low frequency than for rare variants, as expected under a simplistic simulation model, where the HR is constant at all MAFs (Table S3). The allelic parity tests have the highest power, with the weighted version being the most powerful. It may be possible to improve type I error control for the weighted version by including higher order terms (O μ ( ) j 3 and higher) in the expressions for mean and variance of D i ; this investigation is reserved for future work.

| Software and code
A function that implements the tests with an example data set, as well as the code files used in the simulation studies are included as Supporting Information Material. The function is for general use, and can be run in R. The simulation code is intended only for the purpose of replicating the results in this paper, and runs in a multicore Unix environment. member is diagnosed at a young age (before 45 years). However, known genes with variants predisposing individuals to hereditary BC explain less than 50% of disease clustering within families (Easton et al., 2015;A. Lee et al., 2019). The motivating data set is a pilot study of whole exome sequencing (WES) in ASPs with a family history of cancer and early-onset in at least one sibling.
The median age at diagnosis is 45 years, and all but one family have one sib diagnosed before age 45. The ASPs, recruited from the Ontario Familial Breast Cancer Registry (John et al., 2004;Terry et al., 2015), had been screened negative for known mutations in susceptibility genes (including BRCA1/2 and CHEK2*1100delC variants), thereby increasing the chances of finding rare familial mutations; all families except for one were classified as Caucasian. The pilot data set included 37 individuals from 17 families (14 pairs and three triplets). We count triplets as three pairs, yielding N = 23 observations at the ASP level.
In total, 251,931 variants were annotated with MAF information obtained from three reference panels: 1000 Genomes Project n ( = 1,092; all populations), Exome Sequencing Project (n = 6,500), and UK Biobank (n = 500,000). Variants were deemed rare if they appeared in at least one of the panels (using the entire populations for improved precision), and if the maximum MAF from these references was no greater than 0.5%. As a QC step, rare variant loci that were missing in more than a few (4) families were excluded, otherwise missing genotypes were imputed to be the rare allele. Other standard QC procedures followed Genome Analysis Toolkit Best Practice recommendations, and included haplotype calling, variant recalibration, conversion to human genome version hg19, annotation by ANNOVAR, and filtering on read depth and quality. This resulted in 18,035 rare variants that passed quality control, annotated to 9,572 genes (the number of RVs per gene ranged from 1 to 69, with mean 1.9).
We specified the population parameters μ j , used in the weighted allelic parity test, as the median MAF at locus j across the three reference panels. However, we observed that the samples were enriched in rare variants across the exome, in comparison to the allele frequencies in the panels. Therefore, we applied a simple multiplicative genome-wide adjustment factor of 10.1 chosen to match the panel frequencies cumulated at the gene level to the observed frequencies, (see Figure S8 for the details of the calculation). This rescaling amounts to converting an over or underestimated scenario to an unbiased one, which is closest to nominal performance, as per the simulations in Section 3.3, part 1. We note that we used the entire UK Biobank data which includes RVs imputed from genotype data, and that a similar enrichment in exome RVs was found in an exome sequenced subset of the UK Biobank, compared with the entire panel MAFs (imputed from genotype data). Van Hout et al. (2019) report a >fourfold increase in coding variants, and >10-fold increase in loss-of-function variants identified in WES compared with imputed data, with rare variants accounting for the vast majority of this increase.
To determine the IBD sharing in each sib pair, we analyzed 102,322 common autosomal variants (MAF > 0.10) using the multipoint algorithm implemented in MER-LIN (Abecasis, Cherny, Cookson, & Cardon, 2001). Sex-averaged linkage map positions were downloaded from Rutgers University's Map Interpolator. IBD estimates were obtained on genomic segments ("clusters") defined adaptively so that R 2 among any two SNPs in a cluster is more than 0.1 (Abecasis & Wigginton, 2005;Abecasis, n.d.); this improves stability and accuracy of IBD sharing estimates in the absence of parental data. Finally, pairwise IBD sharing estimates for ASPs in each family were obtained on 6,899 clusters spanning chromosomes 1-22.
To illustrate single gene and pathway testing, we aimed to validate a known BC-related functional pathway-DNA repair. If successful, this might help identify previously unreported variants within this pathway as potential hereditary mutations. Pathway information was taken from Dexheimer (2013) and includes 84 genes known to be involved in the various mechanisms of molecular DNA repair; 41 of these genes had at least one RV, hence could be tested. Table 4 reports seven genes with the top p values for the weighted allelic parity tests. This test has the smallest p values among the tests considered, and the top two genes, BLM and MLH1, reach significance accounting for multiple testing (at level 0.05/41 = 0.0012). For pathway level analysis, we first tested the whole DNA repair pathway, and then we tested its component pathways, each having a different biological role in DNA repair (Table 5). The p value for testing the entire DNA repair pathway (significant at the 5% level) is smaller than the p values for each of the component sub-pathways; it is also smaller than the p value of the top gene (MLH1), suggesting aggregate testing can be effective. This confirms our intuition that the signal is dispersed throughout the pathway, and shows that multiple region testing can provide information not captured with gene-level testing.

| DISCUSSION
In this communication, we consider the problem of discovery of rare variants in a sample of ASPs. Our methodological findings make headway in two directions: first, we develop powerful testing methods for this particular study design at the region level. Second, we extend these methods for use at the genetic pathway level. The allelic parity test is novel, to our knowledge, and offers important advantages compared to the other methods considered. It has good type I error properties, and the weighted version can be more powerful than the other tests as evident in all simulation scenarios considered. The power advantage comes at the price of sensitivity to the accuracy of the external RV frequency values, but we propose that this can be remediated by use of an empirical null distribution method. Moreover, we find good robustness to sequencing errors and de novo mutations, as well as to rare variant criteria.
The performance of the allelic parity methods over tests that regress allele count on IBD state can be explained by the fact that allele parity counting (whether alleles appear as singles or duplicates) is a better discriminator between susceptibility and null regions at the sib pair level, compared to IBD state. This is illustrated graphically in Figure S2 in the Supporting Information, which plots allele enrichment under the null and alternative. The regression of counts versus IBD goes from a slope of zero (under the null) to a positive slope (under the alternative), and this is captured by the regression tests. However, a simple linear regression cannot capture the fact that, when IBD is 1, the ratio of duplicate alleles to single copies (i.e., D S 2 : i i ) also increases (>1), which is extra information used by the allelic parity test. Also notable is the general enrichment in rare alleles for all IBD sharing states. This is missed by all tests except for the weighted allelic parity, which compares counts against a baseline level, supplied externally.
We expect that the allelic parity tests we propose will be robust to confounding by population structure or environmental factors, with some caveats. The empirical test compares double and single allele counts within each sibpair and sums up this difference, which is strictly a within-family comparison and therefore robust to population stratification. For environmental exposures shared by the sibpair, the empirical version will be similarly robust. However, a need remains for evaluation of T A B L E 4 Top hits for genes in DNA repair pathways (p value (ap-w) < 0.1), rows are ordered by p value of the allelic parity-weighted test extensions that can account for individual-specific risk factors such as age at menarche. For the weighted version which incorporates a population comparison, robustness to population stratification requires that the external allele frequencies accurately reflect the population structure of the sample families. This means that frequencies should be obtained for each population group, after which a pooled μ j estimate would be computed with weights chosen to match the genetic diversity represented in the sample. As larger more accurate reference population panels are becoming available, it is increasingly feasible to closely match samples to their background population MAFs. With this setup, the denominators in T ap w − could be expressed as aggregate differences within ancestry groups, provided that the same variance in the denominators can be used across groups. With a large enough sample, one could relax this assumption and attempt to standardize the D i 's using different variance estimates for different ancestry groups. The weighted version would likely not be robust to other confounders, but it may be possible to incorporate relevant covariates into this and other test statistics, and further work to investigate such extensions is warranted.
Testing at the pathway level can be informative, especially when small to moderate effects are distributed across functional pathways, a setting in which it would be impossible to detect association at the single region level without a very large sample. Because testing at the pathway level will inevitably include a large number of null variants in the statistics, the signal in a pathway should be rich enough overall, and distributed broadly enough for the tests to be successful at detecting it. Besides power, the other benefit of pathway testing is that it can offer functional insight into the etiology of disease, beyond what a single gene might indicate. Once a pathway has been identified and validated, it follows naturally to examine each component gene (or RV) separately, to gain a deeper understanding of how the pathway operates as a network.
Beyond methodological improvements, implications for study design deserve to be brought to the forefront. Previous authors have reported that the affected sibling design is more cost effective than case/control studies (Epstein et al., 2015;K. H. Lin & Zöllner, 2015;Zöllner, 2012). In particular, for single gene testing, Epstein et al. (2015) demonstrate a twofold power gain for sib pair testing (500 pairs) compared to case-control comparisons (500 each), on average over different effect sizes, and assuming that shared environmental and other genetic factors between sibs do not have a very strong effect on diagnosis. It stands to reason then, since our best test is routinely 2-10 times more powerful than Epstein's, that even under conservative scenarios applying it with an ASP design is likely to compound the power gains compared with case-control. A practical limitation of the ASP design is the availability of ASPs for sequencing. However, at least for studies in which the barrier is cost of sequencing rather than availability of subjects, the proposed test should be of significant interest to investigators looking to detect novel rare variants.   V 0 , V 1 , and V 2 are the sample variances for the counts of rare alleles possessed by affected sib pairs (ASPs) sharing 0, 1, or 2 alleles IBD, and are computed directly from data.

Regression test
We also propose a simpler version of Epstein's test that assumes only one variance parameter (instead of two). This corresponds more closely to standard regression, and is preferable when data are insufficient to estimate sample variances V 0 , V 1 , and V 2 , for example, when N is small, and MAF is low. For a single (contiguous) region, consider the regression The test statistic for β = 0 versus ≠ β 0 is asymptotically normal