• Open Access

Latent variable modeling paradigms for genotype-trait association studies



Characterizing associations among multiple single-nucleotide polymorphisms (SNPs) within and across genes, and measures of disease progression or disease status will potentially offer new insight into disease etiology and disease progression. However, this presents a significant analytic challenge due to the existence of multiple potentially informative genetic loci, as well as environmental and demographic factors, and the generally uncharacterized and complex relationships among them. Latent variable modeling approaches offer a natural framework for analysis of data arising from these population-based genetic association investigations of complex diseases as they are well-suited to uncover simultaneous effects of multiple markers. In this manuscript we describe application and performance of two such latent variable methods, namely structural equation models (SEMs) and mixed effects models (MEMs), and highlight their theoretical overlap. The relative advantages of each paradigm are investigated through simulation studies and, finally, an application to data arising from a study of anti-retroviral-associated dyslipidemia in HIV-infected individuals is provided for illustration.

1 Introduction

The increased availability of data on single-nucleotide polymorphisms (SNPs) has led to heighten interest in understanding how this genetic information correlates with measures of disease progression. One analytic challenge plaguing these genotype-trait association studies is the potential for multiple SNPs to be implicated in complex diseases. In this manuscript, we describe applications and performance of two latent variable paradigms, namely structural equation models (SEMs) and mixed effects models (MEMs), for addressing this challenge.

SEMs constitute a broad range of multivariate regression models that allow complex dependencies among multiple predictors and outcome variables and are widely used in economics, sociology and psychology (Pugesek et al., 2003; Rabe-Hesketh et al., 2004; Skrondal and Rabe-Hesketh, 2004). Several recent manuscripts extend the conventional measurement component of an SEM, conditional on latent variables, to the generalized linear model setting, rendering these models naturally conducive to continuous as well as categorical outcomes (Muthén, 1984; Muthén and Muthén, 2007; Skrondal and Rabe-Hesketh, 2005, 2004; Lee and Shi, 2001; Reboussin and Liang, 1998). Recent applications of SEMs to genetic data include those that aim to reconstruct the linkage disequilibrium structure among genes (Lee et al., 2007) as well as one study to characterize associations between multiple SNPs, smoking, gender and rheumatoid arthritis (Nock et al., 2007). MEMs, widely used to address correlations in repeated-measures and multi-level data (Laird and Ware, 1982), are an alternative latent variable modeling strategy that has been described for characterizing association between multiple SNPs, within and across genes, and a measured trait (Foulkes et al., 2005; Goeman et al., 2004; Foulkes and De Gruttola, 2002).

A growing body of literature exists on the methods for analyzing data arising from candidate gene association studies, including approaches targeted specifically at characterizing combinations of SNPs and their association with a measure of disease status or disease progression. Among these are most notably machine learning applications, including classification and regression trees (CART) (Zhang and Singer, 1999; Breiman et al., 1993), random forests (Bureau et al., 2005; Segal et al., 2004; Breiman, 2001), logic regression (Schwender and Ickstadt, 2008; Kooperberg and Ruczinski, 2005; Ruczinski et al., 2004, 2003; Kooperberg et al., 2001), lasso (Kooperberg et al., 2010; Wu et al., 2009; Tibshirani, 1996), elastic net (Kooperberg et al., 2010; Zou and Hastie, 2005) and Bayesian network (BN) analysis (Rodin and Boerwinkle, 2005; Pearl, 1988). The gains attributable to first-stage creation of meta-variables within these frameworks are also described, for example in Foulkes et al. (2004); Bastone et al. (2004) and Malovini et al. (2009). The former involves a first-stage, unsupervised clustering of individuals based solely on genotype data, followed by application of CART to characterize association, while the later involves a first-stage application of CART to identify clusters, followed by application of BN analysis to characterize association. The latent class approaches described herein similarly involve defining group indicators based on a collection of SNPs and, in turn, relating these to a measured trait for characterizing association; however, both the SEM and MEM approaches detailed below are distinct in that they involve fully parametric modeling of association and corresponding parameter estimation and testing. The present manuscript focuses on the overlap of these two specific latent class paradigms while additional details on several of the alternative approaches listed above, including discussion of their relative merits, can be found in Hastie et al. (2001); Gentleman et al. (2005); Schwender et al. (2008) and Foulkes (2009).

We begin by formalizing the SEM approach for genetic association studies and extend the research of Nock et al. (2007), to characterize broadly the performance under a range of underlying models of association (Section 2.1). Second, we present an extension of the MEM approach of Foulkes et al. (2005), for this setting that offers additional flexibility in defining the model of association through inclusion of cross-classified clusters (Section 2.2). We then highlight the theoretical overlap between SEMs and MEMs (Section 2.3) and explore, through simulation studies, the relative advantages of each approach (Section 3.1). Specifically, we focus on the flexibility and performance under model misspecification. Finally, we apply both approaches, as we as an alternative two-stage BN analysis, to data arising from a study of anti-retroviral therapy (ART)-associated dyslipidemia in HIV (Section 3.2).

2 Methods

2.1 Structural equation model for genetic association studies

We begin by describing how the SEM framework can be applied for analysis of data derived from genetic association studies, where the goal is to characterize associations between genotypes, within and across multiple genetic loci, and a single measure of disease progression or disease status. An extensive literature exists on SEMs, and correspondingly a variety of approaches to specifying the model have been described (Jöreskog, 1975; Bentler and Weeks, 1980; Muthén, 1984, 2002; Sánchez et al., 2005; Skrondal and Rabe-Hesketh, 2005; Muthén and Muthén, 2007). Here, we use the formulation given by Sánchez et al. (2005) and apply the measurement model described by Muthén (1984); Skrondal and Rabe-Hesketh (2005); and Muthén and Muthén (2007).

Let yi denote the trait under study, where equation image represents individuals. Further suppose equation image represents genotypes across S bi-allelic SNPs for individual i. Since each SNP is bi-allelic, we have equation image, where A and a represent the two possible nucleotides at site s and equation image. For simplicity of presentation, we drop the notational dependency of A and a on s. Finally, let equation image represent P measured covariates for individual i.

Similar to the approach described by Nock et al. (2007), we assume that each candidate gene has a corresponding latent variable, representing an unobservable effect of the corresponding gene on the trait. These latent variables are given for individual i by the vector equation image where equation image corresponds to gene k, equation image. The measurement component of an SEM relates the observed data components, Xi and yi, to the latent variables, Ui, while the structural component defines the relationship among the latent variables. These are formulated as follows:

equation image(1)
equation image(2)

where equation image and Γy are unknown parameters, and equation image. Here equation image is used to represent an appropriately defined link function, such as the probit or logit link for categorical and binary outcomes, respectively

equation image(3)

where α, B are unknown parameters, the diagonal elements of B are identically equal to 0 and (IB) is invertible (Sánchez et al., 2005). Here we assume equation image and equation image is independent of εi. In the genetic association study setting, interest is in characterizing the association between the latent variables, Ui, and the measured trait, given by yi. Formally, a test of association is given by a Wald test of the null hypothesis, equation image.

Notably, in the genetic association setting, where Xi represents SNPs as described above, many of the covariates represented in Zi will influence the trait yi but will not be directly predictive of Xi as described in Eq. (1). Covariates that are potentially relevant in this component of the measure model include surrogates for population substructure, such as race and ethnicity, as well as measures of exposure to environmental toxins, such as smoking status, that may result in oncogenic mutations within specific organ tissue. In this sense, Zi can be thought of as a partitioned matrix, given by equation image, where only the covariates represented in equation image are potentially predictive of Xi while the covariates given in both equation image and equation image are potentially predictive of yi. In turn, the element of Γx corresponding to equation image is identically equal to 0.

Visual path diagram representations of this model with one and two latent variables are given in Fig. 1A and B, respectively. Here, observed variables are represented by rectangles while latent variables are given by ovals. Dashed lines represent fixed, independent variables, whereas solid lines indicate dependent variables with corresponding distributional assumptions. Single-direction arrows represent causal relationships among variables while double-headed arrows represent non-zero correlations.

Figure 1.

Sample SEM path diagrams for genetic association studies. (A) Single latent variable SEM and (B) SEM with two latent variables.

2.2 Mixed effect model for genetic association studies

Distinct from the SEM setting, application of an MEM to SNP data is a staged approach that involves first assigning individuals to groups based on observed genotypes across multiple SNPs. These genotype group assignments are then treated as cluster indicators in the usual mixed modeling framework. While several approaches to the first-stage dimension reduction are tenable, such as hierarchical or K-means clustering (Hartigan, 1975), here we apply the simple deterministic approach of assigning individuals with identical multi-locus genotypes to the same genotype group, as described by Foulkes et al. (2004).

Again, we begin by letting equation image represent the multilocus genotype for individual i across S bi-allelic SNPs. Now suppose equation image represents the J groups resulting from assigning individuals with identical genotypes to the same group, that is equation image implies equation image. The MEM as described previously for this setting (Foulkes et al., 2005) can be formulated as follows:

equation image(4)

where ν and Γ are again unknown parameters, equation image, equation image is an indicator for individual i belonging to genotype group gj, equation image is a vector of corresponding random effects of genotype groups on the trait under study, equation image for equation image, equation image and bj and εi are independent. A likelihood ratio test of the null hypothesis H0: σb=0 is applied to assess the presence of a genotype-trait association.

More generally, a grouping variable can be defined for each of multiple genes. To see this, suppose now that Xi represents a vector of S SNPs across K genes. We assume nk SNPs are measured within gene k, such that equation image. Now equation image represents the groups corresponding to gene k where Jk is the number of such groups. Notably, in the setting of three-level SNPs, we have equation image, while for binary SNPs, equation image. The MEM for such cross-classified data is then given by

equation image(5)

where equation image is a vector of group membership indicators, equation image is defined as the genotype group random effects on yi for gene k, equation image and equation image. In this setting, a likelihood ratio test can again be applied to test the null hypothesis equation image for each gene k.

Visual representations of the MEMs for single- and multi-level clustering are given in Fig. 2A and B, using the same notation as described above for Fig. 1. Here, the broken lines indicate a deterministic relationship between SNPs – represented by equation image – and cluster assignments – represented by Ci. A few notable distinctions can be discerned by comparing Figs. 1 and 2. Most notably, in the SEM framework, we see that the SNP variables are treated as random, and modeled as a function of the latent variables, ui. In the MEM setting, on the other hand, these are treated as fixed and inform cluster assignments deterministically. Additionally, in the SEM setting, the latent variables – given by equation image and equation image – are person-specific and potentially correlated, while in the MEM framework the latent effects – given by equation image and equation image – correspond to genotype groups and are independent. Further discussion of theoretical overlap between the two modeling approaches is given in Section 2.3.

Figure 2.

Sample MEM path diagrams for genetic association studies. (A) Single level clustering MEM and (B) MEM with multi-level clustering.

2.3 A comparison of the SEM and MEM approaches

Both the SEM and MEM approaches, as formulated in Sections 2.1 and 2.2, involve modeling underlying latent variables that represent unobservable effects of genes on the trait under study. Indeed, it is straightforward to show that the SEM can be reduced to an MEM for this setting. To begin, consider the simple case of a single gene, and thus a single latent variable. We first omit the regression of Xi on the latent variable – Eq. (1) of the SEM – as the MEM treats the Xi as fixed. For the single gene setting, the models of Eqs. (2) and (3) reduce to

equation image(6)
equation image(7)

where equation image is now a scalar and, because there is only one assumed latent variable, B is identically equal to 0 and equation image. In order for this SEM to reduce to the MEM, we let λy equal the vector equation image and replace the individual level latent variable ui with the vector of random cluster effects equation image. Importantly, this is equivalent to making the assumption that the latent effects on the trait are the same for those individuals with the same observed genetic profile. Finally, we set α=0 and, together, these restrictions yield Eq. (4).

In the case of K=2 genes, we note that Eqs. (2) and (3) can be written as:

equation image(8)
equation image(9)

Now we replace equation image and equation image with the vectors equation image and equation image and replace equation image and equation image with equation image and equation image, respectively. Notably, as the lengths of equation image and equation image (as well as b1 and b2), given by J1 and J2, are not necessarily equal, these vectors need to be concatenated with vectors of 0 of appropriate length. That is, Eq. (8) is replaced with:

equation image(10)

where, without loss of generality, we assume J1<J2 and equation image is defined as a vector of length (J2J1) with all 0 elements. In order for this SEM to simplify to the MEM with two levels of clustering, we additionally need to assume α=0 and equation image. In other words, we must assume that the latent variables are uncorrelated and centered at 0.

In summary, and more generally for K>2 genes, we make the following three assumptions for the SEM to reduce to the MEM: (i) SNPs, represented by Xi, are fixed, so that the model of Eq. (1) is omitted; (ii) ΛyUi is given by equation image, that is individual level latent variables are the same for individuals within the same defined genotype group and (iii) α=B=0, that is latent variables across genes are mutually independent and centered at 0. In the MEM setting, the cluster random effects are assumed independent, although a correlation structure between bj and equation image could be imposed.

3 Applications

In the following sections we report the results of a simulation study and an application to a study of anti-retroviral-associated dyslipidemia in HIV. Restricted maximum likelihood is used to derive point estimates of parameters in the MEMs. A likelihood ratio test of equation image, comparing the full (mixed effects) and reduced (fixed effects only) model, is used to investigate the association between genotype groups and a measured trait. As this involves testing a parameter at a boundary, a mixture of a χ2 with 1 and 0 degrees of freedom is assumed for the resultant test statistic. All MEMs are fitted with the lme() function within the nlme package in R, Version 2.9.1. In the context of fitting SEMs, weighted least squares is applied to derive parameter estimates and a Wald test of the null hypothesis equation image, is reported. SEMs are fitted using Mplus Version 5.21.

3.1 Simulation studies

In this section we explore, through simulation studies, the relative practical performance of SEMs and MEMs under a range of underlying models of association. In each simulation, we generate 500 sets of data, each of size n=1000, for each combination of true parameter values. SEM data are simulated using the MONTECARLO Command in Mplus Version 5.21 (Muthén, 1984; Muthén and Muthén, 2007).

We begin by simulating data under an SEM with a single latent variable, according to Eqs. (1) and (6)–(7), where for simplicity of presentation, we let Γx=0. In scalar notation, this model can be rewritten as:

equation image

where we assume equation image, xis is a binary SNP indicator, equation image, equation image, equation image and equation image. For identifiability, α is set to 0 and equation image is restricted to 1. Furthermore, we define a threshold model in Mplus such that equation image. It is straightforward to show that the covariance between any two SNPs is then given by ψ. Finally, we set equation image, equation image and equation image, while the values of the remaining parameters, λy, equation image, equation image, and ψ, vary as described in Table 1.

Table 1. Simulation results under SEM with one latent variable.
DataTrue valueSEMsMEMs
  1. a

    (a) Median estimates are reported from 500 sets of data. (b) Absolute difference between Est. and true value. (c) Coverage rate, the percentage of confidence intervals (CIs) that cover true value among 500 CIs. For the CR of variance, the CIs that contain negative values are excluded for consideration. (d) Median length among the 500 length of CIs. (e) Wald test statistics of equation image is used to test the association between latent variable and measured trait. (f) Likelihood ratio test (LRT) of equation image is applied to investigate the association between genotype groups and measured trait. Here the approximate distribution of LRT under null hypothesis is 50:50 mixture of equation image and equation image.

1equation image0.000.950.
 equation image0.010.960.21
 equation image0.000.950.140.000.960.13
 equation image−0.020.940.731.001.00
2equation image0.000.950.
 equation image0.010.960.20
 equation image0.000.950.150.000.960.14
 equation image−0.010.950.371.001.00
3equation image0.000.940.
 equation image0.010.960.17
 equation image0.000.950.160.000.950.14
 equation image0.000.970.251.001.00
4equation image0.000.940.190.040.850.18
 equation image0.010.950.22
 equation image0.000.960.130.000.960.13
 equation image−0.010.960.301.001.00
5equation image0.010.950.390.690.000.30
 equation image0.000.960.18
 equation image0.000.950.200.000.950.16
 equation image−0.020.950.561.001.00

The results of fitting both SEMs and MEMs to these data are provided in Table 1. In general, both approaches have high power for detecting association, but the SEM approach performs better in terms of bias and coverage for the variance component, equation image. As ψ and λy increase while equation image remains fixed, the absolute bias associated with equation image for the MEM increases and the corresponding coverage rate (CR) is low. Notably, the type-1 error rate (under the model in which λy=0) is 0.03 for both the SEM and MEM approaches.

Second, we simulate data according to an SEM with two uncorrelated latent variables. That is, we let the data arise from Eqs. (1) and (8)–(9) with B=0. In this case, our model can be written as:

equation image

where equation image and equation image, that is we have 8 SNPs with 4 in each of two genes. We further assume equation image, equation image, equation image, all mutually independent and equation image. For identification, α1 and α2 are set to 0 and we restrict equation image. For the simulation, we set equation image, equation image, equation image, and define a threshold in Mplus such that equation image. Finally, the values of equation image, equation image, γy, equation image, ψ1, and ψ2 are assumed to vary as described in Table 2.

Table 2. Simulation results under SEM with two uncorrelated latent variables.
DataTrue valueSEMsMEMs
1equation image−0.010.940.300.250.000.22
 equation image0.000.950.21
 equation image0.000.940.21
 equation image0.000.940.150.000.950.14
 equation image0.000.960.811.001.00
 equation image0.000.960.821.001.00
2equation image0.000.940.280.340.000.24
 equation image0.000.960.20
 equation image0.000.940.20
 equation image0.000.950.170.000.930.15
 equation image0.000.970.411.001.00
 equation image0.000.940.411.001.00
3equation image0.000.930.270.370.000.24
 equation image0.000.940.18
 equation image0.000.940.17
 equation image0.000.950.180.000.940.15
 equation image0.000.970.271.001.00
 equation image0.000.930.271.001.00
4equation image0.000.940.611.370.000.42
 equation image0.000.950.19
 equation image0.000.930.19
 equation image0.000.960.250.000.940.19
 equation image−0.010.960.671.001.00
 equation image−0.010.940.661.001.00
5equation image0.000.950.410.720.000.31
 equation image0.000.950.22
 equation image0.000.940.18
 equation image0.000.950.200.000.950.16
 equation image0.000.970.411.001.00
 equation image−0.010.930.571.001.00

The results of fitting the SEM and MEM to these data are given in Table 2. The results are similar to those we saw with a single latent variable, with more extreme biases associated with equation image using the MEM approach. The type-1 error rates in the SEM setting are 0.05 and 0.06 for equation image and equation image, respectively, while the type-1 error rates in the MEM setting are 0.04 and 0.05 for equation image and equation image, respectively.

Finally, we simulate data according to an SEM with two correlated latent variables, where the model is the same as described above for two uncorrelated latent variables, with the exception that it is assumed equation image. Data are simulated under two models specified by ψ12=0.1 and ψ12=0.2, where in both cases ψ12=0.4. These models correspond to correlations between latent variables of 0.25 and 0.5, respectively. The results are reported in Table 3. Again power is high under both the SEM and MEM, and similar bias is observed under the MEM for equation image. The corresponding type-1 error rates are 0.05 and 0.06 for both SEM parameters, equation image and equation image, and 0.04 and 0.05 for the two MEM variance parameters.

Table 3. Simulation results under SEM with two correlated latent variables.
DataTrue valueSEMsMEMs
1equation image0.000.930.280.370.000.24
 equation image0.000.950.20
 equation image0.000.940.20
 equation image0.000.950.180.000.930.15
 equation image0.000.940.421.001.00
 equation image−0.010.930.421.001.00
 equation image0.000.960.09
2equation image0.000.930.280.400.000.25
 equation image0.000.940.20
 equation image0.000.940.20
 equation image0.000.950.180.000.940.15
 equation image0.000.950.491.001.00
 equation image0.000.950.491.001.00
 equation image0.000.950.10

Next we simulate data under an MEM with a single clustering variable, as described by Eq. (4). We again assume S=4 SNPs, each coded as binary indicators with equation image and minimal correlation induced by the assumption equation image and equation image for s=1,2,3. A single continuous covariate equation image is generated and we set ν=0. The remaining parameters, equation image, equation image and γ are varied across the simulations as given in Table 4. The results of fitting SEMs and MEMs to these data are reported in Table 4. In this setting, a test of λy=0 has reduced power for detecting association. The type-1 error rates are 0.05 and 0.03 for the SEM and MEM approaches, respectively.

Table 4. Simulation results under MEM with single-level clustering.
DataTrue valueSEMsMEMs
1equation image0.120.420.210.000.960.18
 equation image0.65−0.010.940.331.00
 equation image0.000.930.130.000.940.13
2equation image0.
 equation image0.760.000.950.651.00
 equation image0.000.940.150.000.950.13
3equation image0.330.
 equation image0.77−0.020.950.941.00
 equation image0.000.950.150.000.940.12

Finally, we generate data according to a two-level clustering MEM, as described by Eq. (5). Here we again assume that we observe 4 SNPs in each of 2 genes. The corresponding results of fitting SEMs and MEMs to these data are given in Table 5. Again the power for the SEM approach to detect association based on a test of λy is relatively small in all cases while the power for the MEM approach is reasonable (>90%) for equation image. The estimated type-1 error rates are 0.03 and 0.05 for the SEM parameters and 0.01 for both of the MEM parameters.

Table 5. Simulation results under MEM with two-level clustering.
DataTrue valueSEMsMEMs
1equation image0.310.
 equation image0.65−0.010.980.450.97
 equation image0.68−0.010.960.440.97
 equation image0.000.940.15−0.010.940.25
2equation image0.590.
 equation image0.78−0.010.950.790.99
 equation image0.79−0.020.960.771.00
 equation image0.000.940.160.000.940.25
3equation image0.950.000.360.000.950.18
 equation image0.77−0.030.951.101.00
 equation image0.770.000.951.110.99
 equation image0.000.950.180.000.940.25
4equation image0.470.
 equation image0.73−0.010.970.480.97
 equation image0.74−0.020.960.731.00
 equation image0.000.960.150.000.960.25

3.2 Genetics of therapy-associated lipid abnormalities in HIV

In this section we apply the SEM and MEM frameworks to data arising from the New Works Concept Sheet (NWCS) 224 study, an investigation of genetic factors that contribute to anti-retroviral-associated dyslipidemia in HIV-1-infected individuals. This cross-sectional study is comprised of n=626 HIV-infected participants enrolled in 5 AIDS Clinical Trials Group (ACTG) trials who agreed to genetic testing. A complete discussion of the study design and patient demographics is given in Foulkes et al. (2006). Here we focus on 7 SNPs – rs1045642, rs2032582, rs2235035, rs11772987, rs10256836, rs9282564 and rs2157926 – within the ATP-binding cassette, sub-family B (MDR/TAP), member 1 (ABCB1) gene, a gene involved in transporting substrates, including Protease Inhibitors (PIs) across the cell membrane, and an additional 3 SNPs – rs2854117, rs4520 and rs2070669 – in apolipoprotein C-III (APOC3), a gene involved in inhibiting hepatic uptake of triglyceride-rich particles. All SNPs are treated as binary indicators for the presence of at least one variant allele. Interest is in characterizing association between these SNPs and high-density lipoprotein cholesterol (HDL-C). White/non-Hispanic, Hispanic and Black/non-Hispanic subjects (n=532) with complete data, including known drug exposure histories, are included in analysis.

The results of fitting unadjusted models are reported in Table 6. Here we consider three models: two single-gene models (that include either ABCB1 or APOC3) and one two-gene model (that includes both ABCB1 and APOC3). The SEM and MEM results are consistent with one another, suggesting that ABCB1 is associated with HDL-C, as measured by equation image (p=0.002 and p=0.003) in the SEM and equation image (p=0.02) in the MEM for both the single-gene and two-gene models. These effects are attenuated (and no longer statistically significant) in adjusted models and may represent spurious associations resulting from population-admixture, i.e. confounding by race/ethnicity and study site. Adjusted models also included PI exposure as a three-level factor – no current PI exposure; currently exposed to a non-RTV-containing PI regimen; and currently exposed to an RTV-containing PI regimen – gender, race/ethnicity and study.

Table 6. Application to study of therapy-associated lipid abnormalities in HIV.
 ABCB1 – modelAPOC3 – model(ABCB1, APOC3) – model
 Est (p-value)Est (p-value)Est (p-value)Est (p-value)Est (p-value)Est (p-value)
equation image−0.06 (0.002)0.02 (0.53)−0.06 (0.003); 0.01 (0.84)
equation image0.780.180.78, 0.28
equation image0.004 (0.02)0.00 (0.50)0.004 (0.02); 0.00 (0.50)
equation image0.0970.0960.0990.0990.0970.096

To further explore the results of this model fitting procedure, we consider equation image from the SEM and the empirical Bayes estimates of the random effects from the MEM. For illustration, we focus on the unadjusted model with the single-gene ABCB1. Based on the SEM, the relationship between the SNPs and the latent gene variable, ui, is estimated by equation image corresponding to rs1045642, rs2032582, rs2235035, rs11772987, rs10256836, rs9282564 and rs2157926, respectively. Associated p-values based on a Wald test are given by equation image where p-values of “0” are less than 1×10−8. Recall the first element of Λ is fixed at 1 for identifiability. These results suggest variant alleles at rs2032582 (equation image) and rs9282564 (equation image) are significantly positively associated with ui, while variant alleles at rs11772987 (equation image) and rs2157926 (equation image) are significantly negatively associated with ui. Further, equation image, suggesting an inverse relationship between ui and HDL-C.

In total there are 46 observed genotype groups and the corresponding empirical Bayes estimates based on the MEM range from −0.097 to 0.074, as illustrated in Figure 3. All corresponding 95% prediction intervals for the genotype groups cover zero, with the exception of the group with at least one variant allele at each of the three SNPs, rs1045642, rs2032582 and rs11772987 and homozygous for the common allele at all other SNPs. Membership to this group is inversely associated with HDL-C with a corresponding empirical Bayes estimate of −0.097 and a 95% prediction interval of (−0.185,−0.009).

Figure 3.

Empirical Bayes (EB) estimates of latent genotype group effects.

3.3 An alternative Bayesian network analysis framework

BN analysis is an alternative analysis framework that similarly aims to identify and characterize association among combinations of potential predictor variables and an observed trait. In this section, we briefly describe the application of one such approach, proposed by Malovini et al. (2009). This strategy is comprised of two stages: First, meta-variables are created based on fitting a classification or regression tree to the trait under study where SNPs and other potentially relevant variables are treated as potential predictor variables; and second, a BN is learned based on these meta-variables and the trait under study. Analysis is based on application of the rpart() and network() functions within the R rpart and deal packages, respectively.

The results of applying the approach of Malovini et al. (2009), to the SNPs within ABCB1 and APOC3 separately, and in combination, are provided in Fig. 4. To begin, we fitted a regression tree to the log transformed quantitative trait, HDL-C separately for each gene. For ABCB1, a single split is observed, as illustrated in Fig. 4A, where ABCB1.S7 corresponds to rs2157926, and a cost complexity parameter (cp) of 0.01 is applied for first-stage pruning. This constitutes the metavariable used for the second-stage BN analysis of ABCB1. For APOC3, no splits result in an increase of more than cp=0.01 in the overall R-squared value, and thus individual SNP variables for this gene are used in the second-stage BN analysis. The resulting directed acyclic graphs (DAGs) illustrated in Fig. 4B–D are consistent with the results presented in Table 6 based on the SEM and MEM analyses. Again, an association between ABCB1 and HDL-C is observed (Fig. 4B), while associations between SNPs within APOC3 and HDL-C are not detectable (Fig. 4C). The DAG based on both genes (Fig. 4D) additionally identifies an association between ABCB1 and the first SNP within APOC3.

Figure 4.

Bayesian networks of ABCB1, APOC3 and HDL-C. (A) Fitted regression tree using all SNPs within ABCB1 (coded as binary indicators for the presence of at least one variant allele) as potential predictor variables and log-transformed HDL-C as the outcome. (B) DAG based on ABCB1 metavariable and log-transformed HDLC. (C) DAGs with highest scores based on individual SNPs within APOC3 and log-transformed HDL-C. (D) DAG based on ABCB1 metavariable, individuals SNPs within APOC3 and log-transformed HDL-C.

Finally, as a case study, we applied the two-stage BN approach to a single randomly selected simulated data set from each of the scenarios (i.e. combinations of parameters) described in Tables 1–5. The results are presented in Fig. 5. In-line with the finding presented in Table 1 of consistently high power of the SEM and MEM under the SEM model with a single latent variable, Fig. 5A illustrates that all five corresponding DAGs identify association between the gene metavariable and the trait. Under the SEMs with two uncorrelated latent variables, the BN analysis consistently identifies at least one of the two gene metavariables; however, in 3 of the 5 cases only one gene is identified, as shown in Fig. 5B. The association between the two genes under the SEM with correlated latent variables is detected with higher correlation, as illustrated in Fig. 5C. Finally, the BN analysis appropriately identifies the gene metavariable for the MEM with single-level clustering, and, in 3 of the 4 cases, was able to identify both gene metavariables under the MEM with two-level clustering.

Figure 5.

Bayesian networks of simulated data. (A) DAGs based on data simulated according to a SEM with one latent variable where gene represents a metavariable based on a fitted regression tree. The 5 DAGs correspond to each combination of parameters given in Table 1. (B) DAGs based on data simulated according to a SEM with two uncorrelated latent variables where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 5 DAGs correspond to each combination of parameters given in Table 2. (C) DAGs based on data simulated according to a SEM with two correlated latent variables where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 2 DAGs correspond to each combination of parameters given in Table 3. (D) DAGs based on data simulated according to an MEM with single-level clustering where gene represents a metavariable based on a fitted regression tree. The 3 DAGs correspond to each combination of parameters given in Table 4. (E) DAGs based on data simulated according to an MEM with two-level clustering where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 4 DAGs correspond to each combination of parameters given in Table 5.

4 Discussion

In this manuscript we describe the application of two related latent variable modeling approaches, MEMs and SEMs, for identifying and characterizing genetic contributors to complex diseases. While these two frameworks have some important commonalities, several notable differences emerged during our investigation. These are highlighted by the assumptions listed in Section 2.3, under which the SEM reduces to the MEM as described for this setting. Importantly, in the context of an SEM, the test of association is based on a fixed effect coefficient (λy) relating the latent gene variable to the trait. In the MEM context, on the other hand, the test of association is based on a variance parameter (equation image) of the latent gene effects.

Interestingly, our simulation studies reveal that the performance of the two modeling approaches is comparable under the SEMs in terms of power and type-1 error rates; however, when the data arise from an MEM, power for the SEM approach is lower than the corresponding power for detecting association using the MEM approach. Also of note, when data are simulated under the SEM model, the estimates of the nuisance parameter, σe, under the MEM exhibit substantial bias. In these cases, the estimate of σb is also estimated poorly (results not shown). Notably, for Tables 1 and 2, varequation image and an estimate of this variance under the MEM is equation image. For example, for the first scenario in Table 1, we have equation image and the estimate under the MEM is equation image (results not shown). In general, these are not as closely aligned; however, there appears to be a tradeoff between the two parameters. In turn, estimation of γy depends on equation image. Further research on sensitivity to alternative underlying model specifications may further elucidate the relative merits of each approach. Specifically, SEMs may be more conducive to testing specific hypotheses involving multiple genes and their relationships to one another in a pathway to disease.

A notable limitation of both the MEM and SEM approaches is their potential inability to handle a large number of SNPs. In the context of the MEM, the number of genotype groups can become unwieldy as the number of SNPs increases, as described by Foulkes et al. (2008). Interestingly, inclusion of highly correlated SNPs in the MEM approach results in fewer genotype groups but does not otherwise effect model performance. Furthermore, our preliminary research suggests that iteratively sampling a subset of SNPs and fitting the MEM, and then combining results over the multiple samples (Efron, 1979, 1981; Good, 2005), performs reasonably well (results not shown) in terms of the overall power for detecting association in the setting of a large number of SNPs. The extension of the MEM approach to more than one clustering variable, as described in Section 2.2, also offers the advantage of reducing the total number of genotype groups. For example, if we observe r SNPs in one gene and s SNPs in a second gene, the original formulation of the MEM approach involves equation image genetic groups while the proposed extension involves only equation image genetic groups (across two clustering variables). Additional research is needed to evaluate the performance of this extended MEM approach with multiple SNPs across a larger number of genes.

We also presented the results of applying an alternative two-stage BN analysis approach to the NWCS224 data, as well as to randomly selected simulated data sets. In the real data example, we were unable to fit a regression tree with splits beyond the root node subject to the specified pruning criterion for one of the genes under study. In this case, the analysis reduced to fitting a BN to single SNP variables rather than metavariables as described by Malovini et al. (2009). On the other hand, for the ABCB1 gene, a metavariable did emerge, albeit based on a single SNP, and an association was detected. Importantly, the structure of association that CART is designed to detect, namely conditional associations (Foulkes, 2009), may explain why only a single SNP emerged while for the SEM four SNPs within this gene were identified as statistically relevant. Although our case studies of simulated data suggest that the BN analysis generally (though not always consistently) identified relevant genes, further investigation is needed to characterize the overall performance and type-1 error rates.

Finally, further research is needed to characterize the performance of both the SEM and MEM frameworks in the context of more complex structures of association and underlying genetic models. In a recent manuscript, we describe application of a mixture modeling approach that may be more appropriate than the single normal prior assumption on the random effects under a dominant or recessive genetic model (Au et al., 2010). While the inclusion of cross-classified clusters, as proposed in Section 2.2, allows for consideration of more SNPs through reduction in the number of genotype groups, this approach assumes an additive model of association. More generally, it is of interest to characterize interactions among genes in their effect on the trait under study, an area of on-going research.


The support for this research was provided by National Institutes of Health (NIH) individual research awards, NIH/NHLBI R01HL107196 and NIH/NIDDK R01DK021224.

Conflict of interest The authors have declared no conflict of interest.