SEARCH

SEARCH BY CITATION

Keywords:

  • Complex sampling;
  • condensed coefficients of identity;
  • quasi-score test;
  • survey data;
  • Taylor linearization

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

In population-based household surveys, for example, the National Health and Nutrition Examination Survey (NHANES), blood-related individuals are often sampled from the same household. Therefore, genetic data collected from national household surveys are often correlated due to two levels of clustering (correlation) with one induced by the multistage geographical cluster sampling, and the other induced by biological inheritance among multiple participants within the same sampled household. In this paper, we develop efficient statistical methods that consider the weighting effect induced by the differential selection probabilities in complex sample designs, as well as the clustering (correlation) effects described above. We examine and compare the magnitude of each level of clustering effects under different scenarios and identify the scenario under which the clustering effect induced by one level dominates the other. The proposed method is evaluated via Monte Carlo simulation studies and illustrated using the Hispanic Health and Nutrition Survey (HHANES) with simulated genotype data.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

Household medical examination surveys from various countries such as the Canadian Health Measures Survey (Tremblay et al., 2007), Health 2000 Survey from Finland (Heistaro, 2008), and the US National Health and Nutrition Examinations (NHANES, 2011a) have collected blood from which DNA can be extracted for genetic analyses. For example, the Third National Health and Nutrition Examination Survey, a nationally representative survey of the U.S. population conducted by the National Center for Health Statistics (NCHS), has genotyped candidate genes for participants 12 years and older. NHANES III employed a complex sample design using stratified multistage cluster sampling to select participants (Moonesinghe et al., 2010; NHANES, 2011b) and blood-related individuals are often sampled from the same household. On average, 1.6 persons were sampled per household where sampled persons living in the same household may be blood related (Katki et al., 2010).

Testing Hardy-Weinberg Equilibrium (HWE) of marker genotype frequencies has been widely recommended as a crucial step in genetic association studies. Several methods have been developed to test HWE in simple random samples (SRS) (Guo & Thompson, 1992; Weir, 1996; Ayres & Balding, 1998; Shoemaker et al., 1998; Montoya-Delgado et al., 2001; Wigginton et al., 2005). In genetic studies with SRS of families, a method for testing HWE has been developed to account for genetic correlation between related family members (Bourgain et al., 2004) to improve statistical power. Genetic data collected from national household surveys are often correlated due to multistage geographical cluster sampling, as well as the possible biological inheritance within the household. For example in NHANES, counties or cities are sampled at the first stage with nested cluster sampling at the later stages of sampling. In addition, sample weights that incorporate differential sampling rates (e.g., minorities are sampled at higher rates than ucasians in the NHANES III), adjustments for non-response, and calibration to census totals, are used to obtain unbiased estimates of allele frequencies. Moonesinghe et al. (2010) and Li & Graubard (2009) proposed several tests taking account of the sample weighting and geographical clustering resulting from the complex sample designs. However, they did not specifically consider the clustering effect due to the biological inheritance within the family. She et al. (2009) took account of both levels of clustering for testing HWE. However, their methods do not allow for differential selection probabilities in the family, which commonly occurs in population-based household surveys. For example in NHANES III, adolescents aged 12–19 and persons with age 60+ years are sampled at higher rates within a sampled household. In addition, only one type of familial relationship, i.e., two parents and one offspring, was taken into consideration by She et al. (2009).

In this paper, we provide efficient statistical methods for addressing two levels of clustering (correlation) effects: (1) the correlation due to the multistage geographic cluster sampling induced by complex sample designs and (2) the correlation due to biological inheritance from individuals sampled in the same household. These methods also address the sample weighting that reflects the differential probabilities of selecting households and also persons within the sampled households. We examine and compare the relative magnitudes of the correlation effects from the geographical clustering and from the related family members under different scenarios, and further identify the scenario under which the correlation effect induced by one dominates the other. The statistical methodology is described in the next section. We evaluate the performance of the new tests in terms of type I error rate and the power via Monte Carlo simulation studies. We use the Hispanic Health and Nutrition Survey (HHANES) (NCHS, 1985) where we simulated genotype data to illustrate our methods. Even though HHANES did not conduct genotyping with its blood samples, this survey, unlike the NHANES III, provides documentation about within family sample weights that allowed us to apply one of our proposed methods.

Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

We consider household surveys with stratified multistage cluster sample designs such as that used by NHANES. These types of sample designs are described briefly as follows: The population of individuals is subdivided into disjoint primary sampling units (PSUs) usually based on the geographic locations of residence. For example, PSUs can be small cities or counties or contiguous cities/counties. The PSUs are grouped into strata so that they are approximately homogeneous with respect to certain demographic and geographic characteristics. At the first stage of sampling, a random sample of PSUs is selected from each stratum. At the second stage, smaller geographical units, so called secondary sampling units (SSUs), are randomly sampled from the sampled PSUs. Households/families are further randomly selected from the sampled SSUs, and at the final stage, individuals are randomly selected from sampled households/families. For each sampled individual, the inclusion probability is the product of the probabilities of sampling the units at each stage of sampling, and the corresponding sample weight is defined as the inverse of the inclusion probability. In most surveys the sample weights also involve adjustments for nonresponse and poststratification. The sample weight can be considered as the number of people in the population represented by the sampled individual.

Let there be H strata, with Ih PSUs in the h-th strata. Within the PSU-hi for i= 1, 2, … , Ih, data are collected on Jhi families with Khij individuals selected in the jth family for j= 1, 2, … , Jhi. Consider a locus with the number of a different alleles. Define M=a(a+1)/2, the number of possible distinct genotypes. The data are collected on the vector of variables yhijk= (yhijk.1, … , yhijk.g, … , yhijk.M-1) for each sampled individual, where yhijk.g equals 1 if individual-hijk has genotype g and 0 otherwise for g= 1, … , M-1.

To characterize the departure from HWE, we consider a model based on a fixation coefficient (correlations between any pair of alleles within individuals) by following Bourgain et al. (2004). This model has a distinct parameters: the (a-1) independent allele frequencies p= (p1, … ,pl, … , pa-1) and the fixation coefficient r with the following constraints on these parameters,

  • image

Please note that the frequency of the last allele, inline image Define a parameter vector inline image Under the null hypothesis of HWE, we have fixation coefficient r= 0, and p is a nuisance parameter to be estimated.

Define E(yhijk) =μhijk with μhijk= (μhijk.1, … , μhijk.g, … , μhijk.M-1). If the genotype g is homozygote (e.g., l/l), we have μhijk.g= (1-r)inline image; if the genotype g is heterozygote (e.g., l/l′ for allele l is not allele l′), μhijk.g= 2(1-r)plpl′. The estimating equations for the estimation of parameters inline image, assuming independence of the family members within each family, are given by

  • image(1)

where whijkis a (M-1) by (M-1) diagonal matrix whose elements are the sampling weight associated with individual-hijk. The estimating equation (1) is assuming that the genotypes of sampled individuals are independent and is not considering that the genetic information conveyed by the family members is correlated. Accordingly, the estimating equation (1) can be modified as

  • image(2)

where inline imageand inline image are vectors representing values over family members k = 1,2…,Khij, and inline image represent a block-diagonal matrix whose block-matrices are inline image.

To simplify the notation, we ignore the index of the jth family in the ith PSU in the hth stratum (hij), and use inline image to denote inline image and inline image, respectively. Without loss of generality, consider the case of a biallelic locus (allele A and allele a, resulting in three genotypes AA, Aa, and aa) and the family structure of one parent and two offspring, we have inline image and its marginal mean is

  • image

The first derivatives of inline image with respect to inline image and r are, respectively,

  • image

and

  • image

For the construction of the covariance matrix inline image in estimating equation (2), we take account of the genetic correlation within the family. The covariance matrix for the (hij)th family can then be constructed as

  • image(3)

where SYS denotes that inline image is symmetric, and inline image's are 2 by 2 covariance matrices between the indicators of genotypes in the same individual (inline image) or between different individuals in the family (inline image). Following Bourgain et al. (2004), the inline image's are functions of the Condensed Coefficients of Identity (CCI) Δ= (Δ1, Δ2…,Δ9) for the nine condensed identity states and allele frequencies of p under H0 of r= 0. Here CCI is determined by the family relationship between two individuals (see Lange, 2002, page 82). For example, if the sampled family consists of one parent (P) and two full siblings (O1 and O2), then CCI between Parent-Offspring inline image and between the two full siblings inline image. Further calculation produces

  • image(4)
  • image(5)
  • image

In practice, relationships between two individuals in the family can vary, e.g., parent-offspring, half siblings, full siblings, etc. Families with common familial relationships among sampled individuals will have the same covariance matrices. For example, all the families consisting of one parent and two full siblings will have the same covariance matrix (3).

To test the null hypothesis H0 of r= 0, we propose a quasi-score test statistic. Let inline image denote the solution to inline image, where inline image is the first vector in the estimating equations (2) and inline image is partitioned in the same way as inline image. Under suitable conditions (Binder, 1983; Rao et al., 1998), a quasi-score test statistic,

  • image(7)

is asymptotically a χ2 distributed variable with (U-H) denominator degrees of freedom, where inline image is a consistent estimator of inline image, i.e., the covariance of inline image, and inline image, the total number of sampled PSUs.

To estimate inline image a resampling method, such as jackknife or balanced repeated replication (BRR), is particularly attractive in the case of stratified multistage sampling because post-stratification and unit non-response adjustment can be automatically taken into account through the use of appropriate replicate weights (Rust & Rao, 1996). For example, a jackknife estimator of inline image under stratified multistage sampling with Ih PSUs in the hth stratum is given by

  • image

Here inline image is obtained in the same manner as inline image when the data from the cluster-(hi) are deleted, but using jackknife weight (Korn & Graubard, 1999). Alternatively, along the similar derivation discussed in Rao et al. (1998), we can derive a Taylor linearization variance estimator

  • image

where inline image is the PSU-(hi) total with

  • image

and

  • image

evaluated at inline image, and inline imageis mean of the PSU totals in stratum h.

By assuming the sample of PSUs is selected without replacement, the Taylor linearization variance estimator inline image is expressed as a function of the sample variances of the PSU-totals of Taylor deviates for each stratum. As a result, the correlations at the later sampling stages within the PSU, including the correlation among family members within the family, are automatically captured in inline image. Therefore, even if we do not account for the biological correlation within the family by using independent working correlation matrix in Taylor deviate inline image, the PSU-level variance estimators will still be robust and provide consistent variances that capture the correlations at the PSU level of clustering and later stages of clustering. However, our quasi-score test statistic can achieve higher power by incorporating appropriate biological correlations within the family through a working correlation matrix. This desirable property is validated through the simulation studies in the next section.

Simulation Implementation and Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

We let the finite population be of size N= 500,000 individuals consisting of M = 2500 PSUs with each PSU composed of 40 families with five family members (two parents and three offspring). Considering a biallelic locus (allele A and allele a), the parental genotypes are generated independently according to a multinomial distribution with frequencies of p(aa) = (1-r)inline image+inline image, p(Aa) = 2(1-r)inline image, and p(AA) = (1-r)inline image. Given the parental genotypes, the genotypes of the three children are randomly generated according to Mendelian law.

We employ two sampling designs for randomly selecting 100 PSUs from the total of 2500 PSUs: (1) simple random sampling (SRS) and (2) proportional to the population size sampling (PPS). For the selection of family members within the family, stratified SRS (SSRS) is used with stratum defined by (1) familial relationship (parent or offspring), denoted by SSRS(F), or (2) both genotype (aa or non-aa) and familial relationships, denoted by SSRS(GF). Specifically, under SSRS(F) we select three family members from each of the 40 families in the sampled cluster. Family relationships among the three members are varied, one parent and two offspring (1P2O), two parents and one offspring (2P1O), and three offspring (3O).

For purposes of comparison, in addition to our test statistic proposed in (7), denoted by TS1, we consider five tests, denoted by TS2-TS6, with varying estimators of the covariance inline image for family-hij and of the covariance of the score with respect to the fixation coefficient r, inline image. Specifically, inline image in TS2 ignores the covariances between any pair of family members, for example, the off-diagonal blocks in matrix (3) are replaced by zero matrices; while inline image in TS3 ignores the covariances between any pair of family members as well as within each of the family members, for example, matrix (3) is an identity matrix. Compared to TS1– TS3, tests TS4– TS6 estimate inline image without considering the correlation from the selection of families. In other words, inline image is estimated by the last component of diagonals of the inverse of the information matrix. Note that TS4 can be regarded as the weighted version of the test proposed by Bourgain et al. (2004), which considers the correlation from the genetic inheritance within the family but ignores the correlation induced by hierarchical clustered sampling of families. Under designs without differential sample weighting, e.g., common weights of one under SRS sampling, TS4 is equivalent to the test by Bourgain et al. (2004).

The working covariance matrix for each family inline imagedepends on both the family size and the familial relationship of the selected family members. For example, if family size K = 1, then the covariance matrix will be matrix (4); if K = 2, there are three possible relationships (a) two parents, (b) one parent and one offspring, and (c) two full siblings. All three covariance matrices have common diagonal matrices (4). Off-diagonals for the three covariance matrices are, respectively, zero matrices, (5), and (6). For example, for the relationship (b) consisting of one parent and one offspring, the covariance matrix is given by

  • image

If K = 3, there are three possible relationships of (a) Parent1-Parent2-Offspring, (b) Parent-Offspring1-Offspring2, and (c) Offspring1-Offspring2-Offspring3. The diagonal blocks for all three covariance matrices will be matrix (4); while the off-diagonal blocks will depend on the relationships, being zero matrices, (5) and (6), respectively. For example, for the relationship of Parent-Offspring1-Offspring2 the covariance matrix is given by (3).

To evaluate the performance of TS1– TS6, rejection rate is calculated, i.e., the proportion of the 1000 simulation runs for which the p-value is less than the significance level α. Please note that the rejection rate when the fixation coefficient r= 0 (r > 0) is defined as test size (power). In addition, two evaluation quantities are measured: (1) mean(inline image), i.e., mean of 1000 parameter estimates inline image; 2) VarRatio = var1(inline image)/var2(inline image), where var1(inline image) estimated by the variance of the 1000 estimates inline image and var2(inline image) is the mean of the 1000 estimated variances of inline image. Throughout the simulation studies, we vary allele A frequency inline image= 0.3 and 0.5. Only results with inline image= 0.5 are shown; a similar pattern of results was produced when inline image= 0.3.

In the first simulation study, we select 100 PSUs by SRS sampling. Within the family, SSRS(F) is applied for the selection of the family members, having the familial relationship of 2P1O, 1P2O, or 3O. Table 1 presents the results with varying familial relationships at nominal levels of 5% and 1%. We can observe that the sizes of TS1-TS5 maintain the nominal level across different familial relationships. Since genotypes among families are independent, there is no first-level clustering effect from sampling families. Accordingly, in the variance estimation, only second level of correlation from genetic inheritance within the family needs to be considered. As a result, TS4 maintains the nominal level. TS3, only considering the first-level correlation, also maintains the normal level. TS6, which ignores both levels of correlation, has inflated nominal levels across different familial relationships.

Table 1.  Results of 6 HWE tests for SRS sampling of independent families with varying familial relationships under H0: r=0 at nominal level of 5% (1%).
Test statistics1mean(inline image)VarRatio2Rejection rate3 at nominal level of 5% (1%)
  1. 1TS1-TS3 consider the correlation induced from the hierarchical cluster sampling of the families (first-level correlation), and they are corresponding to full, partial, or no consideration of the correlation from the genetic inheritance within the family (second-level correlation), whereas TS4-TS6 do not consider first-level correlation, but corresponding to full, partial, or no consideration of the second-level correlation.

  2. 2VarRatio = var1(inline image)/var2(inline image), where var1(inline image) is estimated by the variance of the 1000 estimates inline image and var2(inline image) is the mean of the 1000 estimated variances of inline image.

  3. 3Rejection rate under the hypothesis of fixation coefficient r= 0 is test size.

  4. 4TS4 is equivalent to the test proposed by Bourgain et al. (2004).

2 parents and 1 offspring (2P1O)
TS10.500.980.045 (0.010)
TS20.500.980.045 (0.010)
TS30.500.980.045 (0.011)
TS 440.500.990.044 (0.009)
TS50.500.990.044 (0.009)
TS60.500.560.085 (0.021)
1 parent and 2 offspring (1P2O)
TS10.500.940.048 (0.008)
TS20.500.940.045 (0.008)
TS30.500.940.045 (0.009)
TS 440.500.930.049 (0.007)
TS50.500.940.046 (0.006)
TS60.500.530.081 (0.021)
3 offspring (3O)
TS10.500.990.049 (0.012)
TS20.500.990.051 (0.012)
TS30.500.990.049 (0.012)
TS 440.500.900.056 (0.011)
TS50.500.990.049 (0.009)
TS60.500.570.100 (0.028)

In the second simulation study, in order to introduce the first-level of clustering effect we generate a clustered finite population. Specifically, we sort all of the 100,000 families by the number of individuals with genotype aa within the family. The 2500 PSUs are then formed by grouping every 40 families sequentially. One hundred PSUs are selected by two sampling schemes: (1) SRS sampling, and (2) PPS sampling, where the sizes for Cluster1–500, Cluster501–1000, Cluster1001–1500, Cluster1501–2000, and Cluster2001–2500 are specified to be the values of 1, 2, 3, 4, and 5, respectively. As a result, in the PPS sampling design, the sample weights depend on the genotypes and thus families with the larger number of aa genotype will be oversampled. Table 2 presents the size and the power of TS1– TS6 under SRS and PPS sampling of clustered families with 1P2O at a nominal level of 5%. Results from 2P1O/3O familial relationships and at a nominal level of 1% have a similar pattern, and are therefore not shown. Under both designs, TS1– TS3, considering the first level of clustering, maintain the nominal level. TS1, considering both levels of correlation, is most powerful when the fixation coefficient r is 0.2 or 0.3. The variance of inline image is underestimated, especially under SRS, for TS4– TS6 with the variance ratio much larger than one.

Table 2.  Results of 6 HWE tests for SRS and PPS sampling of clustered families with 1P2O at 5% nominal level.
Test statistics1SRS samplingPPS sampling
mean(inline image)VarRatio2Rejection rate3mean(inline image)VarRatio2Rejection rate3
  1. 1TS1-TS3 consider the correlation induced from the hierarchical cluster sampling of the families (first-level correlation), and they are corresponding to full, partial, or no consideration of the correlation from the genetic inheritance within the family (second-level correlation), whereas TS4-TS6 do not consider first-level correlation, but corresponding to full, partial, or no consideration of the second-level correlation.

  2. 2VarRatio = var1(inline image)/var2(inline image), where var1(inline image) is estimated by the variance of the 1000 estimates inline image and var2(inline image) is the mean of the 1000 estimated variances of inline image.

  3. 3Rejection rate under the hypothesis of fixation coefficient r= 0 (>0) is test size (power).

  4. 4TS4 under SRS is equivalent to the test proposed by Bourgain et al. (2004); under PPS, the test by Bourgain et al. (2004) produced biased estimate of pA and inflated type I error under null hypothesis r= 0 with mean inline image, VarRatio = 66.8, and rejection rate = 0.65.

Fixation coefficient r= 0
TS10.501.020.0640.500.980.058
TS20.501.010.0660.500.980.057
TS30.501.020.0670.500.970.055
TS 440.5012.210.5710.506.280.448
TS50.5012.710.5770.506.470.466
TS60.507.110.6190.504.550.456
Fixation coefficient r= 0.2
TS10.500.980.5010.500.930.581
TS20.500.970.3730.500.940.408
TS30.520.830.4230.520.920.413
TS 440.5013.60.9350.507.170.936
TS50.5014.470.8910.507.720.886
TS60.525.720.9020.524.990.880
Fixation coefficient r= 0.3
TS10.500.970.8170.500.970.874
TS20.500.970.6670.500.980.714
TS30.540.800.7050.531.020.708
TS 440.5014.550.9890.508.660.994
TS50.5015.850.9690.509.40.975
TS60.545.500.9720.536.360.972

For Tables 1 and 2, the results are produced under the design where the family members are sampled by SSRS(F) within the family. We implement the third simulation study under the design where family members are sampled by SSRS(GF), i.e., the within-family sampling depends on both genotypes and family relationships. As observed in Table 3, TS1 produced biased estimates of allele frequency pA and the type I error rate is highly inflated due to the numerical problems induced by the weighted inverse of the covariance matrix, i.e., inline image, in the estimating equation (2). With SSRS(GF), the sample weights depend on both genotypes and family relationship, whereas the inverse of the covariance matrix inline imagefor the family (hij) only considers the correlation from familial relationship. As we weight the inverse of the covariance matrix using the weights reflecting the familial relationship as well as the genotypes, the resulting inverse of the covariance matrix will not be able to approximate the inline imageamong all the family members within the family.

Table 3.  Results of 6 HWE tests for 2-stage clustered samples with PPS sampling of clustered families and SSRS(FG) sampling of family members at 5% nominal level.
Test statistics1mean(inline image)VarRatio2Rejection rate3
  1. 1TS1-TS3 consider the correlation induced from the hierarchical cluster sampling of the families (first-level correlation), and they are corresponding to full, partial, or no consideration of the correlation from the genetic inheritance within the family (second-level correlation), whereas TS4-TS6 do not consider first-level correlation, but corresponding to full, partial, or no consideration of the second-level correlation.

  2. 2VarRatio = var1(inline image)/var2(inline image), where var1(inline image) is estimated by the variance of the 1000 estimates inline image and var2(inline image) is the mean of the 1000 estimated variances of inline image.

  3. 3Rejection rate under the hypothesis of fixation coefficient r= 0 (> 0) is test size (power).

Fixation coefficient r= 0
TS10.39> 10000.988
TS1_epnd0.501.000.050
TS20.500.990.051
TS30.500.990.051
TS40.505.070.454
TS50.505.220.448
TS60.504.100.402
Fixation coefficient r= 0.1
TS1_epnd0.501.020.344
TS20.501.020.228
TS30.511.010.240
TS40.505.280.803
TS50.505.860.724
TS60.514.190.679
Fixation coefficient r= 0.2
TS1_epnd0.500.990.879
TS20.500.990.648
TS30.531.040.653
TS40.505.480.992
TS50.506.360.962
TS60.534.660.955

To examine the problem with TS1 further, we design a SSRS(GF) with specified selection probabilities so that the within-family sample weights are integers. We then expand the data by the within-family weights to avoid using within-family weights in TS1. In particular, if both parents are in the same genotypic stratum, then select one parent, otherwise, select both parents. If three offspring in the same genotypic stratum, then select one offspring; if one offspring is in the stratum of aa, then select one offspring from each stratum; if two offspring in stratum of aa, then select all three offspring. As a result, the sample consists of families with differential family sizes and family relationships, and offspring with genotype aa are oversampled. The test statistic, considering both levels of correlations but working with expended data, is referred to as TS1_epnd. It can be observed in Table 4 that TS1_epnd, TS2, and TS3 maintained the nominal level and the TS1_epnd achieves the highest power when r= 0.1 or 0.2.

Table 4.  Results from the Hispanic Health and Nutrition Examination Survey data analysis where the genotype data are generated with allele frequency pA= 0.5 and varying fixation coefficient r.
Test statistics1inline imagep-values
r= 0r= 0.05r= 0.1r= 0.2r= 0r= 0.05r= 0.1r= 0.2
  1. 1TS1-TS3 consider the correlation induced from the selection of clusters (PSUs) of the families, and corresponding to full, partial, or no consideration of the correlation from the genetic inheritance within the family.

TS10.500.510.510.500.640.1900
TS20.500.500.510.500.290.040.010
TS30.500.510.520.540.280.030.010

Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

We use data from the Mexican-American part of the HHANES (NCHS, 1985) that consisted of household interviews and physical examinations conducted between 1982 and 1984 for a random sample of non-institutionalized Mexican Americans aged 6 months to 74 years residing in selected areas in the southwestern part of the US.

The HHANES had a multistage stratified cluster sample design; see Gonzalez et al. (1985) and NCHS (1985) for further details about the sample design. At the last stage of sampling individuals were sampled from participating households with rates based on age: 75% for 2–19 years, 50% for 20–44 years, and 100% for 45–74 years. This within-household sampling was accomplished systematically from a household roster obtained by the interviewer. Within a sampled household, all household members related by blood, marriage, or adoption were considered to be a family. To illustrate our methods, we restricted our analyses to families with a single adult or with married adults with or without children. The numbers of sampled members from these sampled families ranged from 1 to 10 with a median of 2. Since our programs were set up for nuclear families that were no larger than five members sampled, we randomly selected five individuals from each of the 67 families with greater than five members sampled. We adjusted their final weights by multiplying a factor of the number of sampled family members divided by five. There were 2666 sampled families of which (683, 761, 545, 416, or 261) of the sampled families had respectively (1, 2, 3, 4, or 5) person(s) sampled. Thus, a total of 6809 sampled individuals are involved in the data analysis.

We applied the final sample weight for each sampled individual, whijk, which includes the inclusion probability at each stage of sampling and non-response and post-stratification adjustments. Since the HHANES did not genotype their sampled individuals, we generated genotype data using a two-step procedure. In step 1, the allele A frequency pA was generated from the Beta distribution pA∼ Beta((1−r)inline image/r, (1−r) × (1−inline image)/r) when r ≠ 0. When r= 0, pA takes inline image, where r here is the fixation coefficient. In step 2, for each parent, two alleles were drawn at random from the binomial distribution Bin (2, pA). Given parental genotypes, the genotype of a child was randomly generated according to Mendelian law.

We varied the values of the fixation coefficient r to be 0, 0.05, 0.1, and 0.2. We recall, in the last section of simulation studies, that results for allele A frequency of 0.3 and 0.5 have similar patterns. We specify inline image= 0.5. Table 4 presents the estimates of allele frequency inline image and p-values for testing the null hypothesis of Hardy-Weinberg Equilibrium (HWE). The three TSs (TS1-TS3) that take account of the correlation induced from the selection of the families are applied to test HWE, and recall that TS1-TS3 correspond to full, partial, or no consideration of the correlation from the genetic inheritance within the family. Consistent with results from the simulation studies, inline image are approximately unbiased across varying fixation coefficients r when we fully or partially consider the within-family correlation; while inline image becomes biased as r increases if the within-family correlation is not accounted for. All the TSs accept HWE when r= 0 and reject HWE when r≥ 0.05, except for TS1 when r = 0.05 (p-value = 0.19), which could be due to random chance. We check this by running an extra five sets of independently generated genotype data; TS1 rejects HWE with p-values of (0.04, 0.01, 0.00, 0.00, 0.04) when r= 0.05.

We know the putative within-family inclusion probabilities inline image are 0.5, 0.75, or 1.0 in HHANES, if the kth individual in family (hij) is at corresponding ages of <20 years, 20–44 years, or ≥45 years. Therefore, we set the within-family weights to be 2 if the individual is < 20 years of old; 2 if the person is between 20–44 years of age but only one parent is selected in the associated family; 1 if the person is between 20–44 years of old but both parents are selected in the associated family; 1 if the person is ≥45 years. Within-family weights for 20–44 years approximate the actual sample weights but were used here for illustration of the expansion method, which requires integer within-family weights. Moreover, for both parents selected in the same family, their within-family weights will be assigned the value of 1. For construction of family-level weights, we follow Korn & Graubard (2003). As expected, TS1_epnd and TS1 produce similar results in this example (results are not shown).

Discussion and Conclusion

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

Most conventional genetic studies can be classified into two categories: (1) population-based studies, in which samples of unrelated individuals are selected, (2) family-based studies, where samples of families are selected and individuals within each family are genetically related. In most genetic studies the sampling design for either category is specified and considered in the analysis, e.g., testing HWE. Apart from these two categories, this paper is motivated by population-based family data collected from National Household Genetic Surveys, in which families are selected with a well specified complex sampling design, usually involving stratification and hierarchical geographic clustered sampling. As a result, there are two levels of clustering of the collected genetic data in National Household Genetic Surveys. The first level of correlation is the hierarchical geographic clustered sampling of households; the second level of correlation is induced by biological inheritances from individuals sampled in the same household. The genetic frequencies can be correlated within geographic clusters, which can lead to the inflation of variances of the estimates of these frequencies. Also, this paper shows that proper consideration of the genetic correlation within families can improve efficiency for testing HWE. Most national household surveys apply differential selection probabilities of selecting households and individuals within the sampled households across strata, usually defined by demographic or phenotypic characteristics, which can be correlated with the genetic markers of interest. Ignoring these selection probabilities in testing HWE can lead to biased estimates of genetic frequencies (Li & Graubard, 2009). Therefore, these key sampling features need to be considered in our test methods in order to draw valid and efficient inferences for genetic data collected in National Household Genetic Surveys.

In this paper, we develop efficient statistical methods that consider two levels of clustering (correlation) effects with the first level induced by the hierarchical cluster sampling of families and the second level induced by the biological inheritance within the family. By examining the magnitude of each of two levels of clustering effects under different scenarios, we make the following observations. First, when genotypes among families are independent, i.e., there is no first-level clustering effect from sampling families, the test statistics considering either the first-level correlation and/or the second-level correlation maintain the nominal levels. Second, when genotypes among families are correlated, only test statistics considering the first-level correlation maintain the nominal level. The highest power is achieved if the second-level correlation is further taken into account. Third, when within-family sampling is related to the genotypes, the test statistic, considering both levels of correlations but working with expanded data, maintains the nominal level and achieves the highest power.

The proposed test is developed for testing HWE in the general population from which the sample is randomly selected. Therefore, the proposed test would be applied to all the data regardless of the trait. In conventional case-control studies where the sampling fractions for the cases and controls are not used to weight the observations, the cases would be over represented in the combined sample of unweighted cases and controls, which would impact testing for HWE if the genes are related to the case status. Thus, using only the controls to test HWE, as is often done in convention case-control association studies, would closely reflect testing HWE in the population at risk. In population-based surveys where our approach weights the observations to reflect the underlying population, then using the entire sample of weighted cases and controls to test HWE is appropriate.

In typical surveys, sampling is first conducted to select individuals and then the sampled individuals are genotyped. Therefore, the scenario in which individuals are sampled according to their genotypes is not practical. Instead, individuals could be selected by their phenotypic characteristics, e.g., diseased or disease-free, which are often correlated with certain susceptible genetic variations. The magnitude of this correlation will differ depending on the susceptible genetic variations of interest. In the simulation, we assumed a complete correlation of ρ= 1 to study how our proposed test performs under this extreme case. Table 4 shows that TS1 produced biased estimates of allele frequency pA and the type I error rate is inflated when within-family selection is highly related to the genotypes. Under moderate cases in which ρ < 1, the inflation of type I error rate by TS1 would be less than that when ρ= 1. As a remedy, the proposed TS1_epnd maintains the nominal level and achieves the highest power.

Testing HWE has been widely recommended as a crucial step in genetic association studies. Two-stage testing procedures are frequently proposed, e.g., Jahnes et al. (2002), where testing HWE is the preliminary step before testing for association between alleles and disease with a caution that type I error can be distorted (Salanti et al., 2005; Zou, 2006; Zou & Donner, 2006).

In conclusion, based on our small sample empirical results, the test statistics considering both levels of correlations should be used in population-based household surveys with multiple blood-related individuals sampled from the same household. However, if within-family sampling is related to the genotypes, then the test statistic considering only the first level of correlation should be employed.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References

The authors are grateful to anonymous referees for their constructive comments. The research was partially supported by NIH grant 5RO1ESO16626.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Implementation and Results
  6. Example from the Hispanic Health and Nutrition Examination Survey with Generated Genotype Data
  7. Discussion and Conclusion
  8. Acknowledgements
  9. References