Summary
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
In populationbased household surveys, for example, the National Health and Nutrition Examination Survey (NHANES), bloodrelated individuals are often sampled from the same household. Therefore, genetic data collected from national household surveys are often correlated due to two levels of clustering (correlation) with one induced by the multistage geographical cluster sampling, and the other induced by biological inheritance among multiple participants within the same sampled household. In this paper, we develop efficient statistical methods that consider the weighting effect induced by the differential selection probabilities in complex sample designs, as well as the clustering (correlation) effects described above. We examine and compare the magnitude of each level of clustering effects under different scenarios and identify the scenario under which the clustering effect induced by one level dominates the other. The proposed method is evaluated via Monte Carlo simulation studies and illustrated using the Hispanic Health and Nutrition Survey (HHANES) with simulated genotype data.
Introduction
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
Household medical examination surveys from various countries such as the Canadian Health Measures Survey (Tremblay et al., 2007), Health 2000 Survey from Finland (Heistaro, 2008), and the US National Health and Nutrition Examinations (NHANES, 2011a) have collected blood from which DNA can be extracted for genetic analyses. For example, the Third National Health and Nutrition Examination Survey, a nationally representative survey of the U.S. population conducted by the National Center for Health Statistics (NCHS), has genotyped candidate genes for participants 12 years and older. NHANES III employed a complex sample design using stratified multistage cluster sampling to select participants (Moonesinghe et al., 2010; NHANES, 2011b) and bloodrelated individuals are often sampled from the same household. On average, 1.6 persons were sampled per household where sampled persons living in the same household may be blood related (Katki et al., 2010).
Testing HardyWeinberg Equilibrium (HWE) of marker genotype frequencies has been widely recommended as a crucial step in genetic association studies. Several methods have been developed to test HWE in simple random samples (SRS) (Guo & Thompson, 1992; Weir, 1996; Ayres & Balding, 1998; Shoemaker et al., 1998; MontoyaDelgado et al., 2001; Wigginton et al., 2005). In genetic studies with SRS of families, a method for testing HWE has been developed to account for genetic correlation between related family members (Bourgain et al., 2004) to improve statistical power. Genetic data collected from national household surveys are often correlated due to multistage geographical cluster sampling, as well as the possible biological inheritance within the household. For example in NHANES, counties or cities are sampled at the first stage with nested cluster sampling at the later stages of sampling. In addition, sample weights that incorporate differential sampling rates (e.g., minorities are sampled at higher rates than ucasians in the NHANES III), adjustments for nonresponse, and calibration to census totals, are used to obtain unbiased estimates of allele frequencies. Moonesinghe et al. (2010) and Li & Graubard (2009) proposed several tests taking account of the sample weighting and geographical clustering resulting from the complex sample designs. However, they did not specifically consider the clustering effect due to the biological inheritance within the family. She et al. (2009) took account of both levels of clustering for testing HWE. However, their methods do not allow for differential selection probabilities in the family, which commonly occurs in populationbased household surveys. For example in NHANES III, adolescents aged 12–19 and persons with age 60+ years are sampled at higher rates within a sampled household. In addition, only one type of familial relationship, i.e., two parents and one offspring, was taken into consideration by She et al. (2009).
In this paper, we provide efficient statistical methods for addressing two levels of clustering (correlation) effects: (1) the correlation due to the multistage geographic cluster sampling induced by complex sample designs and (2) the correlation due to biological inheritance from individuals sampled in the same household. These methods also address the sample weighting that reflects the differential probabilities of selecting households and also persons within the sampled households. We examine and compare the relative magnitudes of the correlation effects from the geographical clustering and from the related family members under different scenarios, and further identify the scenario under which the correlation effect induced by one dominates the other. The statistical methodology is described in the next section. We evaluate the performance of the new tests in terms of type I error rate and the power via Monte Carlo simulation studies. We use the Hispanic Health and Nutrition Survey (HHANES) (NCHS, 1985) where we simulated genotype data to illustrate our methods. Even though HHANES did not conduct genotyping with its blood samples, this survey, unlike the NHANES III, provides documentation about within family sample weights that allowed us to apply one of our proposed methods.
Methods
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
We consider household surveys with stratified multistage cluster sample designs such as that used by NHANES. These types of sample designs are described briefly as follows: The population of individuals is subdivided into disjoint primary sampling units (PSUs) usually based on the geographic locations of residence. For example, PSUs can be small cities or counties or contiguous cities/counties. The PSUs are grouped into strata so that they are approximately homogeneous with respect to certain demographic and geographic characteristics. At the first stage of sampling, a random sample of PSUs is selected from each stratum. At the second stage, smaller geographical units, so called secondary sampling units (SSUs), are randomly sampled from the sampled PSUs. Households/families are further randomly selected from the sampled SSUs, and at the final stage, individuals are randomly selected from sampled households/families. For each sampled individual, the inclusion probability is the product of the probabilities of sampling the units at each stage of sampling, and the corresponding sample weight is defined as the inverse of the inclusion probability. In most surveys the sample weights also involve adjustments for nonresponse and poststratification. The sample weight can be considered as the number of people in the population represented by the sampled individual.
Let there be H strata, with I_{h} PSUs in the hth strata. Within the PSUhi for i= 1, 2, … , I_{h}, data are collected on J_{hi} families with K_{hij} individuals selected in the jth family for j= 1, 2, … , J_{hi}. Consider a locus with the number of a different alleles. Define M=a(a+1)/2, the number of possible distinct genotypes. The data are collected on the vector of variables y_{hijk}= (y_{hijk.1}, … , y_{hijk.g}, … , y_{hijk.M1}) for each sampled individual, where y_{hijk.g} equals 1 if individualhijk has genotype g and 0 otherwise for g= 1, … , M1.
To characterize the departure from HWE, we consider a model based on a fixation coefficient (correlations between any pair of alleles within individuals) by following Bourgain et al. (2004). This model has a distinct parameters: the (a1) independent allele frequencies p= (p_{1}, … ,p_{l}, … , p_{a1}) and the fixation coefficient r with the following constraints on these parameters,
Please note that the frequency of the last allele, Define a parameter vector Under the null hypothesis of HWE, we have fixation coefficient r= 0, and p is a nuisance parameter to be estimated.
Define E(y_{hijk}) =μ_{hijk} with μ_{hijk}= (μ_{hijk.1}, … , μ_{hijk.g}, … , μ_{hijk.M1}). If the genotype g is homozygote (e.g., l/l), we have μ_{hijk.g}= (1r); if the genotype g is heterozygote (e.g., l/l′ for allele l is not allele l′), μ_{hijk.g}= 2(1r)_{plpl′}. The estimating equations for the estimation of parameters , assuming independence of the family members within each family, are given by
 (1)
where w_{hijk} is a (M1) by (M1) diagonal matrix whose elements are the sampling weight associated with individualhijk. The estimating equation (1) is assuming that the genotypes of sampled individuals are independent and is not considering that the genetic information conveyed by the family members is correlated. Accordingly, the estimating equation (1) can be modified as
 (2)
where and are vectors representing values over family members k = 1,2…,K_{hij}, and represent a blockdiagonal matrix whose blockmatrices are .
The first derivatives of with respect to and r are, respectively,
and
In practice, relationships between two individuals in the family can vary, e.g., parentoffspring, half siblings, full siblings, etc. Families with common familial relationships among sampled individuals will have the same covariance matrices. For example, all the families consisting of one parent and two full siblings will have the same covariance matrix (3).
To test the null hypothesis H_{0} of r= 0, we propose a quasiscore test statistic. Let denote the solution to , where is the first vector in the estimating equations (2) and is partitioned in the same way as . Under suitable conditions (Binder, 1983; Rao et al., 1998), a quasiscore test statistic,
 (7)
is asymptotically a χ^{2} distributed variable with (UH) denominator degrees of freedom, where is a consistent estimator of , i.e., the covariance of , and , the total number of sampled PSUs.
To estimate a resampling method, such as jackknife or balanced repeated replication (BRR), is particularly attractive in the case of stratified multistage sampling because poststratification and unit nonresponse adjustment can be automatically taken into account through the use of appropriate replicate weights (Rust & Rao, 1996). For example, a jackknife estimator of under stratified multistage sampling with I_{h} PSUs in the h^{th} stratum is given by
Here is obtained in the same manner as when the data from the cluster(hi) are deleted, but using jackknife weight (Korn & Graubard, 1999). Alternatively, along the similar derivation discussed in Rao et al. (1998), we can derive a Taylor linearization variance estimator
where is the PSU(hi) total with
and
evaluated at , and is mean of the PSU totals in stratum h.
Simulation Implementation and Results
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
We employ two sampling designs for randomly selecting 100 PSUs from the total of 2500 PSUs: (1) simple random sampling (SRS) and (2) proportional to the population size sampling (PPS). For the selection of family members within the family, stratified SRS (SSRS) is used with stratum defined by (1) familial relationship (parent or offspring), denoted by SSRS(F), or (2) both genotype (aa or nonaa) and familial relationships, denoted by SSRS(GF). Specifically, under SSRS(F) we select three family members from each of the 40 families in the sampled cluster. Family relationships among the three members are varied, one parent and two offspring (1P2O), two parents and one offspring (2P1O), and three offspring (3O).
For purposes of comparison, in addition to our test statistic proposed in (7), denoted by TS_{1}, we consider five tests, denoted by TS_{2}TS_{6}, with varying estimators of the covariance for familyhij and of the covariance of the score with respect to the fixation coefficient r, . Specifically, in TS_{2} ignores the covariances between any pair of family members, for example, the offdiagonal blocks in matrix (3) are replaced by zero matrices; while in TS_{3} ignores the covariances between any pair of family members as well as within each of the family members, for example, matrix (3) is an identity matrix. Compared to TS_{1}– TS_{3}, tests TS_{4}– TS_{6} estimate without considering the correlation from the selection of families. In other words, is estimated by the last component of diagonals of the inverse of the information matrix. Note that TS_{4} can be regarded as the weighted version of the test proposed by Bourgain et al. (2004), which considers the correlation from the genetic inheritance within the family but ignores the correlation induced by hierarchical clustered sampling of families. Under designs without differential sample weighting, e.g., common weights of one under SRS sampling, TS_{4} is equivalent to the test by Bourgain et al. (2004).
The working covariance matrix for each family depends on both the family size and the familial relationship of the selected family members. For example, if family size K = 1, then the covariance matrix will be matrix (4); if K = 2, there are three possible relationships (a) two parents, (b) one parent and one offspring, and (c) two full siblings. All three covariance matrices have common diagonal matrices (4). Offdiagonals for the three covariance matrices are, respectively, zero matrices, (5), and (6). For example, for the relationship (b) consisting of one parent and one offspring, the covariance matrix is given by
If K = 3, there are three possible relationships of (a) Parent1Parent2Offspring, (b) ParentOffspring1Offspring2, and (c) Offspring1Offspring2Offspring3. The diagonal blocks for all three covariance matrices will be matrix (4); while the offdiagonal blocks will depend on the relationships, being zero matrices, (5) and (6), respectively. For example, for the relationship of ParentOffspring1Offspring2 the covariance matrix is given by (3).
In the first simulation study, we select 100 PSUs by SRS sampling. Within the family, SSRS(F) is applied for the selection of the family members, having the familial relationship of 2P1O, 1P2O, or 3O. Table 1 presents the results with varying familial relationships at nominal levels of 5% and 1%. We can observe that the sizes of TS_{1}TS_{5} maintain the nominal level across different familial relationships. Since genotypes among families are independent, there is no firstlevel clustering effect from sampling families. Accordingly, in the variance estimation, only second level of correlation from genetic inheritance within the family needs to be considered. As a result, TS_{4} maintains the nominal level. TS_{3}, only considering the firstlevel correlation, also maintains the normal level. TS_{6}, which ignores both levels of correlation, has inflated nominal levels across different familial relationships.
Table 1. Results of 6 HWE tests for SRS sampling of independent families with varying familial relationships under H0: r=0 at nominal level of 5% (1%). Test statistics^{1}  mean()  VarRatio^{2}  Rejection rate^{3} at nominal level of 5% (1%) 


2 parents and 1 offspring (2P1O) 
TS_{1}  0.50  0.98  0.045 (0.010) 
TS_{2}  0.50  0.98  0.045 (0.010) 
TS_{3}  0.50  0.98  0.045 (0.011) 
TS^{ 4}_{4}  0.50  0.99  0.044 (0.009) 
TS_{5}  0.50  0.99  0.044 (0.009) 
TS_{6}  0.50  0.56  0.085 (0.021) 
1 parent and 2 offspring (1P2O) 
TS_{1}  0.50  0.94  0.048 (0.008) 
TS_{2}  0.50  0.94  0.045 (0.008) 
TS_{3}  0.50  0.94  0.045 (0.009) 
TS^{ 4}_{4}  0.50  0.93  0.049 (0.007) 
TS_{5}  0.50  0.94  0.046 (0.006) 
TS_{6}  0.50  0.53  0.081 (0.021) 
3 offspring (3O) 
TS_{1}  0.50  0.99  0.049 (0.012) 
TS_{2}  0.50  0.99  0.051 (0.012) 
TS_{3}  0.50  0.99  0.049 (0.012) 
TS^{ 4}_{4}  0.50  0.90  0.056 (0.011) 
TS_{5}  0.50  0.99  0.049 (0.009) 
TS_{6}  0.50  0.57  0.100 (0.028) 
In the second simulation study, in order to introduce the firstlevel of clustering effect we generate a clustered finite population. Specifically, we sort all of the 100,000 families by the number of individuals with genotype aa within the family. The 2500 PSUs are then formed by grouping every 40 families sequentially. One hundred PSUs are selected by two sampling schemes: (1) SRS sampling, and (2) PPS sampling, where the sizes for Cluster1–500, Cluster501–1000, Cluster1001–1500, Cluster1501–2000, and Cluster2001–2500 are specified to be the values of 1, 2, 3, 4, and 5, respectively. As a result, in the PPS sampling design, the sample weights depend on the genotypes and thus families with the larger number of aa genotype will be oversampled. Table 2 presents the size and the power of TS_{1}– TS_{6} under SRS and PPS sampling of clustered families with 1P2O at a nominal level of 5%. Results from 2P1O/3O familial relationships and at a nominal level of 1% have a similar pattern, and are therefore not shown. Under both designs, TS_{1}– TS_{3}, considering the first level of clustering, maintain the nominal level. TS_{1}, considering both levels of correlation, is most powerful when the fixation coefficient r is 0.2 or 0.3. The variance of is underestimated, especially under SRS, for TS_{4}– TS_{6} with the variance ratio much larger than one.
Table 2. Results of 6 HWE tests for SRS and PPS sampling of clustered families with 1P2O at 5% nominal level. Test statistics^{1}  SRS sampling  PPS sampling 

mean()  VarRatio^{2}  Rejection rate^{3}  mean()  VarRatio^{2}  Rejection rate^{3} 


Fixation coefficient r= 0 
TS1  0.50  1.02  0.064  0.50  0.98  0.058 
TS2  0.50  1.01  0.066  0.50  0.98  0.057 
TS3  0.50  1.02  0.067  0.50  0.97  0.055 
TS^{ 4}_{4}  0.50  12.21  0.571  0.50  6.28  0.448 
TS5  0.50  12.71  0.577  0.50  6.47  0.466 
TS6  0.50  7.11  0.619  0.50  4.55  0.456 
Fixation coefficient r= 0.2 
TS1  0.50  0.98  0.501  0.50  0.93  0.581 
TS2  0.50  0.97  0.373  0.50  0.94  0.408 
TS3  0.52  0.83  0.423  0.52  0.92  0.413 
TS^{ 4}_{4}  0.50  13.6  0.935  0.50  7.17  0.936 
TS5  0.50  14.47  0.891  0.50  7.72  0.886 
TS6  0.52  5.72  0.902  0.52  4.99  0.880 
Fixation coefficient r= 0.3 
TS1  0.50  0.97  0.817  0.50  0.97  0.874 
TS2  0.50  0.97  0.667  0.50  0.98  0.714 
TS3  0.54  0.80  0.705  0.53  1.02  0.708 
TS^{ 4}_{4}  0.50  14.55  0.989  0.50  8.66  0.994 
TS5  0.50  15.85  0.969  0.50  9.4  0.975 
TS6  0.54  5.50  0.972  0.53  6.36  0.972 
For Tables 1 and 2, the results are produced under the design where the family members are sampled by SSRS(F) within the family. We implement the third simulation study under the design where family members are sampled by SSRS(GF), i.e., the withinfamily sampling depends on both genotypes and family relationships. As observed in Table 3, TS_{1} produced biased estimates of allele frequency p_{A} and the type I error rate is highly inflated due to the numerical problems induced by the weighted inverse of the covariance matrix, i.e., , in the estimating equation (2). With SSRS(GF), the sample weights depend on both genotypes and family relationship, whereas the inverse of the covariance matrix for the family (hij) only considers the correlation from familial relationship. As we weight the inverse of the covariance matrix using the weights reflecting the familial relationship as well as the genotypes, the resulting inverse of the covariance matrix will not be able to approximate the among all the family members within the family.
Table 3. Results of 6 HWE tests for 2stage clustered samples with PPS sampling of clustered families and SSRS(FG) sampling of family members at 5% nominal level. Test statistics^{1}  mean()  VarRatio^{2}  Rejection rate^{3} 


Fixation coefficient r= 0 
TS_{1}  0.39  > 1000  0.988 
TS_{1}_epnd  0.50  1.00  0.050 
TS_{2}  0.50  0.99  0.051 
TS_{3}  0.50  0.99  0.051 
TS_{4}  0.50  5.07  0.454 
TS_{5}  0.50  5.22  0.448 
TS_{6}  0.50  4.10  0.402 
Fixation coefficient r= 0.1 
TS_{1}_epnd  0.50  1.02  0.344 
TS_{2}  0.50  1.02  0.228 
TS_{3}  0.51  1.01  0.240 
TS_{4}  0.50  5.28  0.803 
TS_{5}  0.50  5.86  0.724 
TS_{6}  0.51  4.19  0.679 
Fixation coefficient r= 0.2 
TS_{1}_epnd  0.50  0.99  0.879 
TS_{2}  0.50  0.99  0.648 
TS_{3}  0.53  1.04  0.653 
TS_{4}  0.50  5.48  0.992 
TS_{5}  0.50  6.36  0.962 
TS_{6}  0.53  4.66  0.955 
To examine the problem with TS_{1} further, we design a SSRS(GF) with specified selection probabilities so that the withinfamily sample weights are integers. We then expand the data by the withinfamily weights to avoid using withinfamily weights in TS_{1}. In particular, if both parents are in the same genotypic stratum, then select one parent, otherwise, select both parents. If three offspring in the same genotypic stratum, then select one offspring; if one offspring is in the stratum of aa, then select one offspring from each stratum; if two offspring in stratum of aa, then select all three offspring. As a result, the sample consists of families with differential family sizes and family relationships, and offspring with genotype aa are oversampled. The test statistic, considering both levels of correlations but working with expended data, is referred to as TS_{1}_epnd. It can be observed in Table 4 that TS_{1}_epnd, TS_{2}, and TS_{3} maintained the nominal level and the TS_{1}_epnd achieves the highest power when r= 0.1 or 0.2.
Table 4. Results from the Hispanic Health and Nutrition Examination Survey data analysis where the genotype data are generated with allele frequency p_{A}= 0.5 and varying fixation coefficient r. Test statistics^{1}   pvalues 

r= 0  r= 0.05  r= 0.1  r= 0.2  r= 0  r= 0.05  r= 0.1  r= 0.2 


TS_{1}  0.50  0.51  0.51  0.50  0.64  0.19  0  0 
TS_{2}  0.50  0.50  0.51  0.50  0.29  0.04  0.01  0 
TS_{3}  0.50  0.51  0.52  0.54  0.28  0.03  0.01  0 
Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
We use data from the MexicanAmerican part of the HHANES (NCHS, 1985) that consisted of household interviews and physical examinations conducted between 1982 and 1984 for a random sample of noninstitutionalized Mexican Americans aged 6 months to 74 years residing in selected areas in the southwestern part of the US.
The HHANES had a multistage stratified cluster sample design; see Gonzalez et al. (1985) and NCHS (1985) for further details about the sample design. At the last stage of sampling individuals were sampled from participating households with rates based on age: 75% for 2–19 years, 50% for 20–44 years, and 100% for 45–74 years. This withinhousehold sampling was accomplished systematically from a household roster obtained by the interviewer. Within a sampled household, all household members related by blood, marriage, or adoption were considered to be a family. To illustrate our methods, we restricted our analyses to families with a single adult or with married adults with or without children. The numbers of sampled members from these sampled families ranged from 1 to 10 with a median of 2. Since our programs were set up for nuclear families that were no larger than five members sampled, we randomly selected five individuals from each of the 67 families with greater than five members sampled. We adjusted their final weights by multiplying a factor of the number of sampled family members divided by five. There were 2666 sampled families of which (683, 761, 545, 416, or 261) of the sampled families had respectively (1, 2, 3, 4, or 5) person(s) sampled. Thus, a total of 6809 sampled individuals are involved in the data analysis.
We applied the final sample weight for each sampled individual, w_{hijk}, which includes the inclusion probability at each stage of sampling and nonresponse and poststratification adjustments. Since the HHANES did not genotype their sampled individuals, we generated genotype data using a twostep procedure. In step 1, the allele A frequency p_{A} was generated from the Beta distribution p_{A}∼ Beta((1−r)/r, (1−r) × (1−)/r) when r ≠ 0. When r= 0, p_{A} takes , where r here is the fixation coefficient. In step 2, for each parent, two alleles were drawn at random from the binomial distribution Bin (2, p_{A}). Given parental genotypes, the genotype of a child was randomly generated according to Mendelian law.
We varied the values of the fixation coefficient r to be 0, 0.05, 0.1, and 0.2. We recall, in the last section of simulation studies, that results for allele A frequency of 0.3 and 0.5 have similar patterns. We specify = 0.5. Table 4 presents the estimates of allele frequency and pvalues for testing the null hypothesis of HardyWeinberg Equilibrium (HWE). The three TSs (TS_{1}TS_{3}) that take account of the correlation induced from the selection of the families are applied to test HWE, and recall that TS_{1}TS_{3} correspond to full, partial, or no consideration of the correlation from the genetic inheritance within the family. Consistent with results from the simulation studies, are approximately unbiased across varying fixation coefficients r when we fully or partially consider the withinfamily correlation; while becomes biased as r increases if the withinfamily correlation is not accounted for. All the TSs accept HWE when r= 0 and reject HWE when r≥ 0.05, except for TS_{1} when r = 0.05 (pvalue = 0.19), which could be due to random chance. We check this by running an extra five sets of independently generated genotype data; TS_{1} rejects HWE with pvalues of (0.04, 0.01, 0.00, 0.00, 0.04) when r= 0.05.
We know the putative withinfamily inclusion probabilities are 0.5, 0.75, or 1.0 in HHANES, if the k^{th} individual in family (hij) is at corresponding ages of <20 years, 20–44 years, or ≥45 years. Therefore, we set the withinfamily weights to be 2 if the individual is < 20 years of old; 2 if the person is between 20–44 years of age but only one parent is selected in the associated family; 1 if the person is between 20–44 years of old but both parents are selected in the associated family; 1 if the person is ≥45 years. Withinfamily weights for 20–44 years approximate the actual sample weights but were used here for illustration of the expansion method, which requires integer withinfamily weights. Moreover, for both parents selected in the same family, their withinfamily weights will be assigned the value of 1. For construction of familylevel weights, we follow Korn & Graubard (2003). As expected, TS_{1_epnd} and TS_{1} produce similar results in this example (results are not shown).
Discussion and Conclusion
 Top of page
 Summary
 Introduction
 Methods
 Simulation Implementation and Results
 Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
 Discussion and Conclusion
 Acknowledgements
 References
Most conventional genetic studies can be classified into two categories: (1) populationbased studies, in which samples of unrelated individuals are selected, (2) familybased studies, where samples of families are selected and individuals within each family are genetically related. In most genetic studies the sampling design for either category is specified and considered in the analysis, e.g., testing HWE. Apart from these two categories, this paper is motivated by populationbased family data collected from National Household Genetic Surveys, in which families are selected with a well specified complex sampling design, usually involving stratification and hierarchical geographic clustered sampling. As a result, there are two levels of clustering of the collected genetic data in National Household Genetic Surveys. The first level of correlation is the hierarchical geographic clustered sampling of households; the second level of correlation is induced by biological inheritances from individuals sampled in the same household. The genetic frequencies can be correlated within geographic clusters, which can lead to the inflation of variances of the estimates of these frequencies. Also, this paper shows that proper consideration of the genetic correlation within families can improve efficiency for testing HWE. Most national household surveys apply differential selection probabilities of selecting households and individuals within the sampled households across strata, usually defined by demographic or phenotypic characteristics, which can be correlated with the genetic markers of interest. Ignoring these selection probabilities in testing HWE can lead to biased estimates of genetic frequencies (Li & Graubard, 2009). Therefore, these key sampling features need to be considered in our test methods in order to draw valid and efficient inferences for genetic data collected in National Household Genetic Surveys.
In this paper, we develop efficient statistical methods that consider two levels of clustering (correlation) effects with the first level induced by the hierarchical cluster sampling of families and the second level induced by the biological inheritance within the family. By examining the magnitude of each of two levels of clustering effects under different scenarios, we make the following observations. First, when genotypes among families are independent, i.e., there is no firstlevel clustering effect from sampling families, the test statistics considering either the firstlevel correlation and/or the secondlevel correlation maintain the nominal levels. Second, when genotypes among families are correlated, only test statistics considering the firstlevel correlation maintain the nominal level. The highest power is achieved if the secondlevel correlation is further taken into account. Third, when withinfamily sampling is related to the genotypes, the test statistic, considering both levels of correlations but working with expanded data, maintains the nominal level and achieves the highest power.
The proposed test is developed for testing HWE in the general population from which the sample is randomly selected. Therefore, the proposed test would be applied to all the data regardless of the trait. In conventional casecontrol studies where the sampling fractions for the cases and controls are not used to weight the observations, the cases would be over represented in the combined sample of unweighted cases and controls, which would impact testing for HWE if the genes are related to the case status. Thus, using only the controls to test HWE, as is often done in convention casecontrol association studies, would closely reflect testing HWE in the population at risk. In populationbased surveys where our approach weights the observations to reflect the underlying population, then using the entire sample of weighted cases and controls to test HWE is appropriate.
In typical surveys, sampling is first conducted to select individuals and then the sampled individuals are genotyped. Therefore, the scenario in which individuals are sampled according to their genotypes is not practical. Instead, individuals could be selected by their phenotypic characteristics, e.g., diseased or diseasefree, which are often correlated with certain susceptible genetic variations. The magnitude of this correlation will differ depending on the susceptible genetic variations of interest. In the simulation, we assumed a complete correlation of ρ= 1 to study how our proposed test performs under this extreme case. Table 4 shows that TS_{1} produced biased estimates of allele frequency p_{A} and the type I error rate is inflated when withinfamily selection is highly related to the genotypes. Under moderate cases in which ρ < 1, the inflation of type I error rate by TS_{1} would be less than that when ρ= 1. As a remedy, the proposed TS_{1}_epnd maintains the nominal level and achieves the highest power.
Testing HWE has been widely recommended as a crucial step in genetic association studies. Twostage testing procedures are frequently proposed, e.g., Jahnes et al. (2002), where testing HWE is the preliminary step before testing for association between alleles and disease with a caution that type I error can be distorted (Salanti et al., 2005; Zou, 2006; Zou & Donner, 2006).
In conclusion, based on our small sample empirical results, the test statistics considering both levels of correlations should be used in populationbased household surveys with multiple bloodrelated individuals sampled from the same household. However, if withinfamily sampling is related to the genotypes, then the test statistic considering only the first level of correlation should be employed.