Structural Equation Modeling of Gene–Environment Interactions in Coronary Heart Disease


Corresponding author: Kent M. Eskridge, Department of Statistics, University of Nebraska, Lincoln, Nebraska 68583-0963. Tel: (402) 472-7213; Fax: (402) 472-5179; E-mail:


Coronary heart disease (CHD) is a complex disease, which is influenced not only by genetic and environmental factors but also by gene–environment (GE) interactions in interconnected biological pathways or networks. The classical methods are inadequate for identifying GE interactions due to the complex relationships among risk factors, mediating risk factors (e.g., hypertension, blood lipids, and glucose), and CHD. Our aim was to develop a two-level structural equation model (SEM) to identify genes and GE interactions in the progress of CHD to take into account the causal structure among mediating risk factors and CHD (Level 1), and hierarchical family structure (Level 2). The method was applied to the Framingham Heart Study (FHS) Offspring Cohort data. Our approach has several advantages over classical methods: (1) it provides important insight into how genes and contributing factors affect CHD by investigating the direct, indirect, and total effects; and (2) it aids the development of biological models that more realistically reflect the complex biological pathways or networks. Using our method, we are able to detect GE interaction of SERPINE1 and body mass index (BMI) on CHD, which has not been reported. We conclude that SEM modeling of GE interaction can be applied in the analysis of complex epidemiological data sets.


Coronary heart disease (CHD) is the most common form of heart disease in most developed countries. Despite encouraging declines over the past decades, it remains the leading cause of death in the United States. Like other complex diseases, CHD is influenced by multiple risk factors that interconnect with one another in complicated ways (Talmud, 2004; 2006). For example, cigarette smoking may directly affect CHD by damaging the vascular endothelium, and indirectly affect CHD by perturbing lipoprotein metabolism, increasing insulin resistance, and lipid intolerance. Interactions between genetic and environmental risk factors can sometimes be more important than the values of the individual factors in determining the probability of developing disease (Yang & Khoury, 1997). Understanding this complex interplay of genes and environment will lead us to new methods of disease detection and prevention. Studies have found GE interactions that contribute to the risk CHD, but results have not been consistent across studies. For example, previous studies reported that the risk of CHD is influenced by an interaction between smoking and APOE polymorphism (Humphries et al., 2001; Djoussé et al., 2002; Stengard et al., 1995; Talmud et al., 2005), but two large well-designed studies have provided conflicting results (Keavney et al., 2003; Liu et al., 2003). Such inconsistencies across studies may be related to population differences, differences in the approaches used to assess GE interaction, confounding, low power, or sampling variation (see, e.g., Talmud, 2006).

The FHS, a population-based prospective study, began in 1948. The aim of the FHS is to identify the common factors or characteristics that contribute to cardiovascular disease (CVD) by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke. In 1971, a second-generation group, the Offspring Cohort, was enrolled and followed every 4 years since enrollment. In 2002, a third generation cohort was also added. Over the years, the FHS and other epidemiological studies have produced many basic findings that are important today with respect to the understanding of the causes of CVD including genetic factors. The major risk factors are direct causes of CVD, including cigarette smoking, hypertension, cholesterol disorders, elevated glucose, and advancing age. The underlying risk factors: overweight/obesity, physical inactivity, diet, family history of premature CVD, and various genetic factors affect CVD risk by acting through the major risk factors, and they also appear to influence risk in ways unrelated to the major risk factors. For example, familial influences on risk of CVD are mediated in part through blood pressure and blood lipoprotein levels (Grundy et al., 1998; Smith et al., 2004). Patients with diabetes are at high risk of hypertension and cholesterol disorders such as atherogenic dyslipidemia and high level of low-density lipoprotein (LDL) cholesterol (Grundy et al., 2002). A clustering of risk factors including obesity, atherogenic dyslipidemia, and elevated blood pressure and glucose (metabolic syndrome) are under the pleiotropic control of several loci (Stein et al., 2003). Blood glucose (BG) is strongly related to blood pressure, positively related to triglyceride and negatively related to high-density lipoprotein (HDL) (Cambien et al., 1987). We now have well-established risk factors for CVD that are interconnected in a complex biological system.

Despite the complexity of development of CHD, GE interactions are frequently evaluated using (1) contingency table analysis, considering one or two exposures or genes at a time, singly or in pairwise combinations, thereby neglecting the potential confounding effects of other factors or more complex interactions (Botto & Khoury, 2001); and (2) multiple regression analysis (univariate approaches), including all variables in a single model (Djoussé et al., 2002; Humphries et al., 2001; Talmud, 2004), ignoring indirect effects through other mediating risk factors and pleiotropic effects, and as a result, neglecting an important source of information that could provide additional insight into GE interactions. Several joint analyses on multiple traits have been developed to take into account the correlation or causal relationship among multiple traits in genetic association studies. These methods have been shown to improve the power of association tests and precision of parameter estimates (Jiang & Zeng, 1995; Zhu & Zhang, 2009). Therefore, a joint analysis on mediating risk factors and CHD is needed to reflect the underlying roles of risk factors for the disease.

Structural equation model (SEM) is a generalization of simultaneous equation procedures originating from path analysis (Wright, 1921) and initially popularized in econometrics and genetics. Recently, it has been used to examine GE interactions for twin data (Rijsdijk & Sham, 2002; McCaffery et al., 2009), and applied to functionally related traits in genetic research with the goal of characterizing genetic architecture precisely and intuitively (Nadeau et al., 2003; Gianola & Sorensen, 2004; Li et al., 2006; Neto et al., 2008; Zhu & Zhang, 2009). Nock et al. (2009) studied the genetic determinants of Metabolic Syndrome in the Framingham Heart Study (FHS) with conventional SEM, where parameters were estimated by minimizing the difference between the observed covariance matrix and the model-predicted covariance matrix assuming multinormally distributed continuous variables. Their method, however, did not focus on the study of GE interaction and categorical outcomes, and was different from our approach. SEM is a useful method for estimating and evaluating simultaneous causal relationships among variables, which allows variables to be both dependents and predictors. In particular, SEM allows researchers to decompose the effects of one variable on another into direct, indirect, and total effects. Direct effects are the influence of one variable on another that are not causally explained by any other intermediary variable. Indirect effects are relationships that can be explained by at least one other intervening variable, and the total effect is the sum of direct and indirect effects. By explicitly accounting for the underlying roles of the risk factors for CHD, SEM can provide more insight and a richer understanding of how the risk factors influence CHD.

The FHS data consisted of family structure information, genotypic information, and long-term phenotypic information. Full sibs share genes and often, live under common environmental conditions for a period of their lives, which makes them a good source for a study in GE interaction. The objectives of this study were to use the FHS Offspring Cohort data to identify GE interactions in the progress of CHD using a generalized two-level SEM to account for the causal structure among mediating risk factors and CHD and to compare the proposed method with a classical univariate approach (logistic regression) in terms of identification of GE interactions.


Generalized Two-Level SEM

In this section, we formulate the statistical model for the analyses of GE interactions in complex diseases using a generalized two-level SEM, which is a synthesis of a two-level regression model and a SEM. The two-level regression model is used to account for the hierarchical data structure with individuals (Level 1), which are nested in families (Level 2). The random intercepts represent heterogeneity between families in the overall response. SEM is a multivariate statistical technique that often includes estimation of the effects of unobserved or latent constructs (the measurement model) and of the structural relationships among these latent constructs (the structural model). In this paper, estimating relations among the mediating risk factors and CHD (the structural part of the model) is our primary interest, and every variable in the model is directly measured or observed. When the dependent variables are categorical, the conventional SEM needs to be modified by creating “latent responses” or “underlying variables” that are linked to observed categorical responses via threshold models, then a conventional SEM is specified for multivariate normal latent responses. Note that models with categorical dependent variables require model estimation procedures that are different and considerably more complex than for conventional SEM (Skrondal & Rabe-Hesketh, 2004; Rabe-Hesketh et al., 2004).

Suppose that we have a sample of individuals clustered in N families with observations on p traits, which may consist of both dichotomous (e.g., presence/absence of disease) and continuous (e.g., lipid level or BG) responses, along with a number of environmental and genetic risk factors. Let y*ij be a (p× 1) vector of continuous, possibly latent response variables for the jth subject in the ith family, i= 1…N, j= 1…ni, and let xij be a (q× 1) vector of independent variables for which no distributional specification is imposed. The statistical model for a subject in a family is formulated with individuals at the first level and families at the second level in terms of the between-family and within-family covariance matrices, ΣB and ΣW, respectively (Longford & Muthén, 1992; Poon & Lee, 1992).

In the context of SEM, an interaction between observed variables is handled by creating an interaction variable via multiplication of two variables. It is tested by adding the interaction variable to the model. The presence of a significant interaction indicates that the effect of one predictor variable on the response variable is different at different values of the other predictor variable.

Threshold Model

For a dichotomous observed response (such as CHD or presence/absence of hypertension) yijk (k= 1, 2, …p) is related to an unobserved latent continuous response y*ijk via a threshold model (Muthen, 1984) described as follows:


where τk is a threshold parameter and pr (yijk= 1|xij) =pr (y*ijk > τk |xij).

For a continuous observed response, y*ijk is directly observed,


Structural Model

Within-family model—Level 1

The linear structural model can be specified as


where intercept vi is a (p× 1) vector of means of underlying responses over all individuals in the ith family; B is a (p×p) matrix of structural parameters that describes the causal relationship among p latent responses, where (IB)−1 exists; Γ is a (p×q) coefficient matrix that describes the causal relationships between the latent responses and the predictor variables; xij is a (q× 1) vector of observed predictor variables (can be categorical or continuous variables), which includes the genotypic covariates, environmental covariates, and the GE interactions created as the cross-products of genotypic and environmental covariates; ζij is a (p× 1) vector of residuals, which is assumed to be multivariate normally distributed with mean vector zero and covariance matrix Ψ. Note that this model is conditional on the observed predictor variables xij. Unlike conventional SEMs where all observed variables are treated as responses, there are no distributional assumptions regarding xij (Muthen, 1984; Skrondal & Rabe-Hesketh, 2004). To account for the correlation of individuals within a family, the intercept was allowed to vary across families and was defined at the second level.

Family random-intercept model—Level 2

Define the family level model as


where vi is a p× 1 vector of means of underlying responses in ith family, which represent heterogeneity between families in the overall response; γ is a (p× 1) vector of overall means of underlying responses; ξi is a (p× 1) vector of residuals and is assumed to be multivariate normally distributed with mean vector zero and covariance matrix Θ. The random effects at different levels of the model are assumed to be independent.

The model for the Level-1 units (jth individual) and Level-2 units (ith family) can be written by substituting equation (3) into (2) and solving for the reduced form, which gives a generalized two-level SEM


The between-family and within-family covariance matrices are derived as ΣB= (IB)−1Θ[(IB)−1]′ and ΣW= (IB)−1Ψ[(IB)−1]′. The expected value and covariance matrix of y*ij are derived as


where μij is the implied mean vector of the endogenous variables; Σ is the implied variance–covariance matrix among the endogenous variables by the use of SEM. Both uij and Σ contain parameters to be estimated.


The model described in (5) and (6) is unlike the conventional SEM where parameters are estimated by minimizing the difference between the observed covariance matrix and the model-predicted covariance matrix assuming multinormally distributed continuous variables. Instead, the likelihood of the observed data must be obtained by somehow “integrating out” the latent responses y*ij. Let θ be the vector of all parameters including the regression coefficients, threshold parameters τk for dichotomous variables, and the nonduplicated elements of the covariance matrix. Let yi be the observed response vector and xi, the vector of predictor variables for all subjects in the ith family. Let y and X be the response vector and matrix of predictor variables for all subjects. Given equations (5) and (6), the latent responses y*ij follow a multivariate normal distribution with mean uij and covariance matrix Σ. We will denote the multivariate normal density of the latent responses at the family level as hi(y*i; θ). The marginal likelihood is constructed recursively. The conditional density of the observed variables of a family level yi, conditional on the latent variables y*i is (Rabe-Hesketh et al., 2004)


where the product is over all subjects within the family. Since the N families are assumed to be sampled independently, the total marginal likelihood is the product of the contributions from all families,


Skrondal and Rabe-Hesketh (2004) provide an extensive overview of estimation methods for SEMs with noncontinuous variables and related models. Muthen & Satorra (1996) developed a general three-stage procedure to obtain estimates, standard errors, and a χ2 measure of fit for a given structural model with a mixture of dichotomous, ordered categorical, and continuous measures of latent variables. This estimation approach was computationally efficient and was implemented in Mplus (Muthen & Muthen, 2004).

The likelihood ratio (LR) statistic


is used for hypothesis tests and model selection, where inline image is the maximum likelihood (ML) estimator under the reduced model, and inline image is the ML estimator under the full model. The LR has an asymptotically χ2 distribution when the reduced model is true. The degrees of freedom are the difference in the degrees freedom for the two models (Bollen, 1989).

Study Population

For this study, a cross-sectional design was conducted using the Offspring Cohort data of the FHS. Subjects who participated in the fifth examination cycle were selected. The parents of the Offspring were excluded. The analyses were based on the 966 participants with complete data on family structure, and variables used. The data consisted of 444 full sibships from 279 families. Of these families, only nine contained half siblings, so we only considered the correlations among full siblings, the correlations among individuals in the same family but with one or two different parents were ignored in this analysis. Individuals were used as Level-1 units and full sibships were used as Level-2 units of hierarchical family structure. The endpoint was CHD, which was defined as present if subjects had (presence of a diagnosis of CHD) angina pectoris, myocardial infarction, coronary insufficiency, and sudden and nonsudden death. Other phenotypes of interest were hypertension, ratio of total cholesterol and high-density lipoprotein (TC/HDL-c), fasting BG. Hypertension was defined as systolic BP ≥ 140 mmHg or diastolic BP ≥ 90 mmHg or if subjects were currently taking medication to lower high BP. Two readings obtained by the physician and one reading obtained by the nurse were averaged to calculate systolic and diastolic BP values. Genes of interest included apolipoprotein E (APOE) and single-nucleotide polymorphisms (SNPs) rs1799768 in the human plasminogen activator inhibitor-1 (SERPINE1) (previously PAI-1), because the previous studies have shown that the genes were associated with CHD. As defined in the majority of previous studies, we grouped carriers of ɛ2 allele as those who had genotype E2/E2 or E2/E3, ɛ3 allele as those who had genotype E3/E3, and ɛ4 allele as those who had genotype E3/E4 or E4/E4. SERPINE1 was classified into three possible genotypes 4G/4G, 4G/5G, and 5G/5G. Following the theory of the development of CHD and scientific research (Grundy, 1999; Grundy et al., 2002; Smith et al., 2004) discussed in the previous section, we developed the conceptual model presented in Figure 1. The model illustrates the relationships among the mediating risk factors hypertension, BG, TC/HDL-c, and endpoint CHD. The covariates of interest were age, gender, body mass index (BMI), number of cigarettes per day, and alcohol consumption (oz/week).

Figure 1.

Theoretical relationships among these four phenotypes.

Statistical Analyses

Our proposed two-level SEM was implemented using Mplus Version 5.1 software (Muthen & Muthen, 2004) (Type = Twolevel Estimator = Ml Algorithm = Integration) to account for the relationships among the mediating risk factors hypertension, BG, TC/HDL-c, and endpoint CHD and for the hierarchical family structure with individuals at the first level and full sibships at the second level. The backward elimination procedure was performed in two steps. First, we started with the saturated model (full model) that was built by fitting the model with age, sex, BMI, cigarettes per day, alcohol, APOE, and SERPINE1 as explanatory variables to predict these four related phenotypes (Fig. 1), then the reduced model was fit in which the regression coefficient of the least important variable (with the largest P-value) was constrained to zero. The LR test statistic was obtained by LR =−2(loglikelihoodreduced− loglikelihoodfull), which is approximately χ2 distributed with one degree of freedom. The above procedure was repeated until all the variables were important (P < 0.05). Second, to determine which GE interaction terms should be included in the final model, the backward elimination procedure was performed starting with all possible interactions between the explanatory variables identified in the first step. The overall model goodness-of-fit to the data can be evaluated by the χ2 test, comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). However, these model fit indices are not available for models with categorical responses in Mplus. Instead we used Akaike's Information Criteria (AIC) and Bayesian Information Criteria (BIC) fit statistics and significance tests of path coefficients to assess the overall model goodness-of-fit. Smaller AIC and BIC and significant path coefficients indicated the model was acceptable and a good fit.


Characteristics of the study population are shown in Table 1. A total of 466 men and 470 women were assessed, of whom 62 (12.5%) men and 23 (4.9%) women had CHD as previously defined. Their clinical characteristics according to SERPINE1 genotypes and APOE genotypes are summarized in Tables 2 and 3. The ɛ4 carriers had higher levels of total cholesterol, and ratio of total cholesterol and HDL cholesterol; the ɛ2 carriers had lower levels of total cholesterol, and ratio of total cholesterol and HDL cholesterol (Table 3).

Table 1.  Characteristics of the study population.
 Male (N= 496)Female (N= 470)
Subjects with CHD (N= 62)Subjects without CHD (N= 434)Significant P*Subjects with CHD (N= 23)Subjects without CHD (N= 447)Significant P*
  1. *P-values were obtained by use of ANOVA for continuous variables and χ2 for categorical variables.

Age58.7 ± 9.351.1 ± 10.0<.000158.4 ± 9.551.9 ± 9.90.002
Body mass index (kg/m2)29.1 ± 3.828.1 ± 4.00.08228.2 ± 5.526.8 ± 5.840.2651
Cigarettes per day6.1 ± 12.54.0 ± 10.30.13586.8 ± 12.53.9 ± 9.30.1572
Total alcohol per week (Oz)3.0 ± 5.53.7 ± 4.540.2541.1 ± 2.21.7 ± 2.50.3169
Systolic blood pressure (mm Hg)129.7 ± 18.6126.5 ± 15.40.1386132.8 ± 19.7120.4 ± 18.00.0015
Diastolic blood pressure (mmHg)76.9 ± 8.777.4 ± 8.90.64175.9 ± 9.8471.9 ± 9.30.0448
Hypertension (%)43.626.30.004847.822.60.0056
Antihypertensive treatment (%)<.0001
Blood glucose (mg/dL)112.1 ± 37.9101.9 ± 26.30.0077106.1 ± 34.197.8 ± 29.10.1863
Total cholesterol (mg/dL)202.1 ± 35.1202.0 ± 35.20.9865225.7 ± 36.9202.8 ± 37.90.0048
HDL cholesterol (mg/dL)38.5 ± 9.743.5 ± 11.30.00148.7 ± 14.556.2 ± 14.10.0134
TC/HDL-c5.7 ± 2.44.9 ± 1.40.00044.9 ± 1.43.8 ± 1.3<.0001
Table 2.  Characteristics of the study population by SERPINE1 genotypes.
 G4/G4 (N= 267)G4/G5 (N= 491)G5/G5 (N= 208)Significant P*
  1. *P-values were obtained by use of ANOVA for continuous variables and χ2 for categorical variables.

Age51.4 ± 9.552.4 ± 10.452.3 ± 10.020.4372
Body mass index (kg/m2)27.8 ± 4.827.5 ± 5.127.5 ± 4.90.6821
Cigarettes per day4.6 ± 10.83.6 ± 9.44.9 ± 10.60.1861
Total alcohol per week (Oz)2.6 ± 3.92.7 ± 3.92.8 ± 3.80.8177
Systolic blood pressure (mmHg)123.3 ± 16.64124.6 ± 17.8123.8 ± 17.00.6299
Diastolic blood pressure (mmHg)74.9 ± 9.374.8 ± 9.474.5 ± 9.90.9060
Hypertension (%)25.827.922.60.3411
Antihypertensive treatment (%)13.114.312.00.7166
CHD (%)
Blood glucose (mg/dL)102.89 ± 32.9100.3 ± 27.399.29 ± 26.90.3466
Total cholesterol (mg/dL)201.6 ± 36.2202.4 ± 36.9206.0 ± 36.40.3767
HDL cholesterol (mg/dL)47.7 ± 13.149.3 ±14.550.6 ± 15.30.0872
TC/HDL-c4.61 ± 1.54.5 ± 1.64.4 ± 1.40.5938
Table 3.  Characteristics of the study population by APOE genotypes.
 E2 (N= 123)E3 (N= 638)E4 (N= 205)Significant P*
  1. *P-values were obtained by use of ANOVA for continuous variables and χ2 for categorical variables. Genotypes E2/E2 and E2/E3 were grouped as E2, genotype E3/E4 and E4/E4 were grouped as E4, and genotype E3/E3 as E3.

Age52.6 ± 9.952.2 ± 10.451.4 ± 9.20.5115
Body mass index (kg/m2)27.6 ± 4.627.7 ± 5.027.2 ± 5.00.4812
Cigarettes per day5.2 ± 10.24.3 ± 10.42.9 ± 8.70.0971
Total alcohol per week (Oz)2.6 ± 4.12.6 ± 3.863.0 ± 3.90.4369
Systolic blood pressure (mmHg)123.0 ± 16.2125.0 ± 17.4121.7 ± 17.40.0476
Diastolic blood pressure (mmHg)74.7 ± 9.374.9 ± 9.474.4 ± 10.10.7457
Hypertension (%)25.227.921.50.1831
Antihypertensive treatment (%)13.813.812.20.8369
CHD (%)
Blood glucose (mg/dL)101.3 ± 30.8100.7 ± 28.0100.7 ± 30.40.9782
Total cholesterol (mg/dL)186.2 ± 36.8204.4 ± 35.7208.6 ± 36.5<.0001
HDL cholesterol (mg/dL)50.1 ± 14.049.5 ± 14.747.6 ± 13.30.212
TC/HDL-c4.0 ± 1.44.5 ± 1.64.7 ± 1.40.0006

Figure 2 shows the estimates of path coefficients of the final two-level SEM, including gene–environment (GE) interaction, in the development of CHD. A path coefficient is a standardized regression coefficient (beta) showing the direct effect of an independent variable on a dependent variable in the path model. Increasing age, cigarette smoking, being male, hypertension, and high TC/HDL-c are major risk factors that influence CHD risk directly. There were significant differences in TC/HDL-c levels among the APOE genotype groups (P < 0.0001). The ɛ4 carriers had the higher and ɛ2 carriers had the lower TC/HDL-c ratio compared with ɛ3 carriers. We observed a significant effect of the interaction between number of cigarettes per day and the APOE genotype (β=−0.02, P= 0.0064), implying that the APOE genotypes influenced the response of lipid level to the number of cigarettes per day. Additionally, the ɛ2 carriers showed a greater (slope) response to cigarettes than the other two genotypes. APOE and APOE× cigarettes/day influenced the risk of CHD by influencing the mediating risk factor (TC/HDL-c). Although no significant effect of SERPINE1 was found, we did find a significant interaction between SERPINE1 and BMI for CHD through BG level (β=−0.502, P= 0.0244). The G4/G4 carriers had the greatest response of BG level to BMI. We also observed a significant effect of alcohol consumption on TC/HDL-c, which ultimately reduced the CHD risk.

Figure 2.

Path estimates of SEM of gene by environment interaction in the development of CHD. The model fitted the data well, and all the path coefficients were significant. Single arrows indicate causal relationships. Numbers by the arrow lines represent the estimated coefficients with significance level: ***P < 0.001; **P < 0.01; *P < 0.05.

Table 4 presents the estimates of direct, indirect, and total effects on the development of CHD and estimates of CHD odds ratios for major risk factors that influence CHD directly using the two-level SEM. The path coefficients were used to calculate the indirect and total effects. For example, the indirect effects of SERPINE1× BMI on CHD were calculated by multiplying the path coefficients for each path from SERPINE1× BMI to CHD and summing the products (SERPINE1× BMI -> BG -> Hypertension -> CHD is −0.525 × 0.005 × 0.393 =−0.00103; SERPINE1× BMI -> BG -> TC/HDL-c -> CHD is −0.525 × 0.009 × 0.28 =−0.00137. Hence, the total indirect effect of SERPINE1× BMI on CHD is the sum of all the indirect effects of associated SERPINE1× BMI to CHD ((−0.00103) + (−0.00137) =−0.0024). The total SERPINE1× BMI on CHD is the sum of direct and indirect effects of SERPINE1× BMI on CHD (−0.0024 + 0 =−0.0024). Hypertension and high TC/HDL-c are major risk factors that influence CHD directly. The estimated odds of developing CHD is 1.481 (95% confidence limits (CL): 1.021, 2.150) times higher for hypertensive subjects as compared to nonhypertensive subjects. The estimated CHD odds ratio is 1.323 (95% CL: 1.131, 1.548) times higher for subjects with a 1 unit larger TC/HDL. Being male, increasing age, and number of cigarettes per day are also major risk factors that influence CHD directly and indirectly through mediating risk factors such as TC/HDL-c, hypertension, and BG. Alcohol consumption, BMI, APOE, APOE× cigarettes/day, SERPINE1, and SERPINE1× BMI effect CHD only through mediating risk factors. These results show that our two-level SEM analysis method provides additional information on how the risk factors affect CHD both directly and indirectly.

Table 4.  Estimated risk effects on CHD and CHD odds ratios for direct effects using a generalized two-level SEM, the Framingham Offspring Study.
EffectOdds ratios (95% CL)EffectEffect
Hypertension0.39301.481 (1.021, 2.150) 0.3930
TC/HDL-c0.28001.323 (1.131, 1.548) 0.2800
Blood glucose (mg/dL)  0.00450.0045
Sex−0.91100.402 (0.224, 0.721)−0.3002−1.2112
Age0.07801.081 (1.048, 1.116)0.04720.1252
Cigarettes per day0.02801.028 (1.004, 1.053)0.01850.0465
Total alcohol per week (Oz)  −0.0106−0.0106
Body mass index (BMI) (kg/m2)  0.09630.0963
APOE  0.12210.1221
APOE× cigarettes/day  −0.0064−0.0064
SERPINE1  0.05730.0573
SERPINE1× BMI  −0.0024−0.0024

As a comparison with a standard approach, Table 5 shows the summary results with a univariate logistic regression approach, which has only one dependent variable, CHD. We only observed three significant effects, TC/HDL-c, age, and sex, on CHD. No significant effects of gene or gene by environment interactions were found, which likely resulted because the effects of gene or gene by environment interactions were small. The estimated CHD odds ratio is 1.286 (95% CL: 1.091, 1.515) times greater when TC/HDL-c is increased by 1. The estimated odds of developing CHD are 0.374 (95% CL: 0.205, 0.686) times lower for females than males. The estimated CHD odds ratio is 1.082 (95% CL: 1.047, 1.119) times greater when age is increased by 1. The major differences between our proposed two-level SEM method and the univariate logistic approach were that not only more significant risk factors (APOE, APOE× cigarettes/day and SERPINE1× BMI) were found, but that the SEM method allowed one to fit a more complex and biologically realistic model, which allowed estimation of direct and indirect risk effects on CHD.

Table 5.  Estimated risk effects on CHD and CHD odds ratios using a univariate logistic regression, the Framingham Offspring Study.
EffectEstimateStandard errorP-value (t-test)Odds ratios (95% CL)
Hypertension0.33890.29530.25161.403 (0.786, 2.507)
TC/HDL-c0.25120.08350.00281.286 (1.091, 1.515)
Blood glucose (mg/dL)0.00060.00400.88811.001 (0.993, 1.008)
Sex−0.98230.30790.00150.374 (0.205, 0.686)
Age0.07920.0167<.00011.082 (1.047, 1.119)
Cigarettes per day0.06550.04850.17711.068 (0.971, 1.175)
Total alcohol per week (Oz)−0.03060.03540.38770.970 (0.905, 1.040)
Body mass index (BMI) (kg/m2)−0.05860.09090.51980.943 (0.789, 1.128)
APOE0.19730.26840.46251.218 (0.719, 2.064)
APOE× cigarettes/day−0.01680.02250.4545 
SERPINE1−1.07021.21880.38030.343 (0.031, 3.760)

Table 6 shows the estimated correlations between exogenous and endogenous variables based on our proposed model. For the two latent endogenous variables (CHD and hypertension), correlations were not calculated.

Table 6.  Estimated correlations between endogenous and exogenous variables using a generalized two-level SEM, the Framingham Offspring Study.
 TC/HDL-cBlood glucose (mg/dL)SexAgeCigarettes/dayAlcohol (Oz)/wkBMIAPOEAPOE× Cigarettes/daySERPINE1SERPINE1× BMI
Blood glucose0.2581.000         
Cigarettes per day0.0900.039-0.009−0.0601.000      
SERPINE1× BMI−0.032−0.0450.0080.0330.0030.020−0.0230.0280.0120.8811.000


In this work, we presented a generalized two-level SEM to model the development of CHD, which included genotype and GE interactions, using the FHS Offspring Cohort data. Compared with a classical univariate method (logistic regression), our approach had several advantages: (1) it provided important insights into how genes and contributing factors affect CHD by investigating the direct, indirect, and total effects, (2) it aided with the development of biological models that more realistically reflect the complex biological pathways and networks, and (3) more significant risk factors (APOE, APOE× smoking and SERPINE1× BMI) were found when compared to a traditional univariate logistic regression approach. These many advantages should encourage researchers to use this method more frequently in the analysis of complex epidemiological data.

APOE is a major component of LDL and HDL. It plays a key role in the metabolisms of cholesterol and triglyceride by serving as a receptor-binding ligand removing excess cholesterol from the plasma and carrying it to the liver for processing (Dallongeville et al., 1992; Eichner et al., 2002). The structural gene locus of this apolipoprotein is polymorphic (Utermann et al., 1977). Three major APOE isoforms encoded by three common alleles (ɛ2, ɛ3, and ɛ4) at the APOE locus have been studied extensively, and results from several studies (Song et al., 2004; Wilson et al., 1996; Hixson, 1991; Davignon et al., 1988) showed that compared to ɛ3 homozygotes, ɛ4 carriers have the highest CHD risk and ɛ2 carriers the lowest. The likely mechanism for the APOE polymorphism effects on CHD risk may be through lipid metabolism (Song et al., 2004; Davignon et al., 1999). However, none of the current statistical approaches take into account this underlying mechanism. The results from this study showed a significant interaction between smoking and APOE polymorphism on CHD, which is consistent with previous findings (Humphries et al., 2001; Djoussé et al., 2002; Stengard et al., 1995; Talmud et al., 2005) and which supports possible pathogenesis of CHD, and roles of lipid levels on bridging APOE, APOE× smoking, and CHD.

PAI-1 is an inhibitor of fibrinolysis, serving in the control of atherothrombosis and insulin resistance (Alessi & Juhan-Vague, 2006). Several SNPs in the human SERPINE1 gene have been identified (Dawson et al., 1993), among which the 4G/5G insertion/deletion polymorphism located in position −675 of the promoter region has been studied extensively. The 4G/4G genotype of the SERPINE1 gene had been associated with higher PAI-1 levels compared to the 4G/5G and 5G/5G genotypes (Humphries et al., 1992; Dawson et al., 1993; Eriksson et al., 1995). Elevated plasma PAI-1 levels are associated with a reduced fibrinolytic activity that in turn plays an essential role in the pathogenesis of cardiovascular risk and other diseases associated with thrombosis. Studies have demonstrated that PAI-1 levels are a risk factor for CHD (Hamsten et al., 1987; Eriksson et al., 1995; Juhan-Vague et al., 1996), and diabetes (Mansfield et al., 1995; Festa et al., 2002; 2006; Meigs et al., 2006; Kanaya et al., 2006). Similar to the previous report that PAI-1 and coronary events are related principally to insulin resistance syndrome (obesity, glucose intolerance, hypertension, and dyslipidemia) (Juhan-Vague et al., 1996; Anand et al., 2003), we found that a significant SERPINE1× BMI interaction influences the risk of CHD by influencing BG, which has not been reported previously. In contrast, other studies showed a significant interaction between smoking and the SERPINE1 gene on the risk of CHD (Morange et al., 2007; Su et al., 2006). The SERPINE1× BMI interaction effect on CHD independent of traditional risk factors remains to be confirmed in longitudinal studies.

Finally, the present study has limitations. First, the study population is comprised of predominantly Caucasian residents of Framingham, Massachusetts, which restricts the potential applicability of the findings to other ethnic groups where a CVD may be more prevalent. In addition, the inclusion of multiple clinical measurements per subject over time may have enhanced the accurate assessment of GE interactions. However, multiple measurements added complexity causing computation convergence problems and will be pursued in future research.


The FHS is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. We are grateful to the contribution of Boston University and NHLBI Framingham staff.