Journal of the Royal Statistical Society: Series C (Applied Statistics)
  • Open Access

Mixture modelling as an exploratory framework for genotype–trait associations

Authors


Andrea S. Foulkes, Division of Biostatistics and Epidemiology, University of Massachusetts, Arnold House, 715 North Pleasant Street, Amherst, MA 01003, USA. E-mail: foulkes@schoolph.umass.edu

Abstract

Summary.  We propose a mixture modelling framework for both identifying and exploring the nature of genotype–trait associations. This framework extends the classical mixed effects modelling approach for this setting by incorporating a Gaussian mixture distribution for random genotype effects. The primary advantages of this paradigm over existing approaches include that the mixture modelling framework addresses the degrees-of-freedom challenge that is inherent in application of the usual fixed effects analysis of covariance, relaxes the restrictive single normal distribution assumption of the classical mixed effects models and offers an exploratory framework for discovery of underlying structure across multiple genetic loci. An application to data arising from a study of antiretroviral-associated dyslipidaemia in human immunodeficiency virus infection is presented. Extensive simulations studies are also implemented to investigate the performance of this approach.

1. Introduction

Population-based genetic association studies of unrelated individuals provide us with a rich source of data for investigating the genetic basis of complex diseases. Joint analysis of multiple genetic markers within and across genes is increasingly popular as these analyses may lend additional insight into associations. At the same time, model-based methods play an important role in the analysis of data derived from these studies because they serve as a flexible framework for incorporating covariate information, including environmental, demographic and clinical factors. Analysis of covariance is one commonly used testing framework for characterizing genotype–trait associations based on multilocus genotypes or haplotypes (Tzeng et al., 2006; Schaid et al., 2002). The majority of these analyses apply either Wald-type or score-type statistics for testing association. In contrast with single single-nucleotide polymorphism (SNP) based studies, these analyses are hindered by potentially large degrees of freedom. For example, given n biallelic SNPs, there are 3n possible multilocus genotypes. Although in practice the actual number of such patterns is limited by sample size and linkage between loci, the resulting number of groups in an analysis-of-covariance model can increase rapidly. Thus, as we include more SNPs in the analysis, the degrees of freedom for the corresponding test statistic can become unwieldy, leading to a reduction in statistical power (Tzeng et al., 2006). In many instances, the degrees of freedom are spent on genetic variants that are relatively rare and for which no true association is detectable owing to insufficient power. Such a reduction of power is a main cause of the high false negative rate and non-consistently reproducible findings of association studies (Tzeng et al., 2006).

As a means of addressing this degrees-of-freedom challenge, a mixed effects modelling approach has been proposed recently for the analysis of data arising from genetic association studies, as described in Foulkes et al. (2005, 2007). A similar global testing approach has also been described for the analysis of gene expression data (Goeman et al., 2004). Mixed effects models provide a flexible statistical framework for controlling for potential confounders and identifying interactions between multiple genes and environmental factors that explain the variability in the measured trait. This is achieved simply through inclusion of these quantities as covariates in fixed and random-effects design matrices respectively. Within this framework, the associations between genetic variants and a trait are detected with the application of a single degree of freedom test. Notably, the degrees of freedom of this omnibus test are unaffected by the number of genetic variants and the approach is easily implemented by using existing software tools, including SAS PROCMIXED and the nlme package (Pinheiro et al., 2009) in R. Mixed effects modelling hence provides a complementary approach to study multilocus genotypes involving a large number of potential informative genetic patterns.

In this paper we propose a more general mixture modelling framework. This framework aims to explore association between a single trait, such as a quantitative measure of disease progression, and multilocus genotypes by testing for the existence of association, and then characterizing this association as a latent class structure. This is a natural extension of the mixed effects model approach for association studies, in which a Gaussian mixture distribution is assumed for random genotype effects. Indeed, in the case that a single Gaussian distribution is appropriate for the random effects, the mixture model that we describe herein reduces to the usual mixed effects modelling framework that was presented in Foulkes et al. (2005). The primary advantages of this paradigm over classical analysis-of-covariance and mixed effects models include that the mixture modelling framework

  • (a) addresses the degrees-of-freedom challenge that is inherent in application of the usual fixed effects analysis of covariance for multilocus genotype,
  • (b) relaxes the restrictive single-Gaussian assumption of the mixed effects model as described previously and
  • (c) offers an exploratory framework for discovery of latent class structure.

Several approaches to relax the distribution assumption of random effects in the context of a mixed effects model have been proposed in the general biostatistics literature. See, for example, Magder and Zeger (1996) and Zhang and Davidian (2004). In this paper, we consider modelling the random effects as a mixture of Gaussian distributions as described in Verbeke and Lesaffre (1996), which can accommodate a broad class of distributions, including multimodal and highly skewed distributions. To our knowledge, application of the mixture modelling framework for exploration of multilocus genotype–trait associations has not been described. In a recent paper, Schumacher and Kraft (2007) proposed an application of a Bayesian latent class model with a mixture prior to select signal-bearing SNPs from a large number of loci in the context of a genomewide association study. This approach is notably different from the mixture modelling approach that is described herein. Specifically, Schumacher and Kraft (2007) reported posterior odds for each SNP based on an assumption about the prior distribution of the log-odds ratios. A fully Bayesian approach is applied with estimation achieved via Gibbs sampling. The goal of Schumacher and Kraft (2007) is to provide shrinkage estimates of single SNP effects by drawing strength from the totality of the data (Hoggart et al., 2008; Lunn et al., 2006). In the present study, an alternative modelling fitting paradigm is applied and, importantly, the aim is to group individuals (rather than SNPs) on the basis of multiple (rather than single) SNPs within and across genes. This provides a framework for discovery of combinations of markers that together explain the variability in the trait under study. In addition, a key feature of the approach proposed is that the number of components in the mixture distribution is data driven rather than assumed. This offers additional flexibility for characterizing complex multilocus genotype–trait associations.

This study is organized as follows. In Sections 2.1 and 2.2, we begin by introducing our notation and the mixture modelling framework. Here, we focus on the application of this modelling framework to data derived from genotype–trait association studies. In Sections 2.3 and 2.4 we describe a testing and model selection framework for determining the appropriateness of the mixture model and discovering latent structure. An application to real data arising from a study of antiretroviral-associated dyslipidaemia in human immunodeficiency virus infection is described in Section 3 and a simulation study is presented and discussed in Section 4. Finally, we offer a discussion of our findings in Section 5.

2. Methods

2.1. Notation and background

In the context of population-based association studies, inferential interest often rests in characterizing the association between multiple SNPs within and across genes, and a measure of disease status or disease progression. The former is referred to as the multilocus genotype, where locus refers to the site on the genome, whereas the latter is commonly referred to as a phenotype or trait. Here we focus on a single trait and aim to describe the relationships between combinations of SNPs and this trait.

Let inline image be the set of all possible multilocus genotypes across the observed sites. For example, suppose that we observe two biallelic SNPs, with levels (AA, Aa and aa) and (BB, Bb and bb). Here A, a, B and b represent the nucleotides (A, C, T or G) at the corresponding sites. In this case, there are nine possible multilocus genotypes, given by inline image. We begin by grouping individuals with identical observed multilocus genotypes, as described in Foulkes et al. (2004, 2005). Specifically, we let an individual belong to genotype group Gi if the observed multilocus genotype is equal to gi. For example, in Table 1 we report the observed multilocus genotypes for one data example (see Section 3). Here n=8 subjects have the same multilocus genotype (CC GG AA AA GG CC TT AA)—pattern 59—and are thus assigned to the same genotype group. Alternative formulations of genotype groups are tenable, as described in Foulkes et al. (2004). This approach to grouping individuals is similar to defining clusters on the basis of known familial relationships, though a select set of observed genetic markers is used to define the groups. On the basis of this concept, we define Z as an n×m matrix of indicators for genotype group membership, similar to the usual design matrix in an analysis-of-variance model, where n is the number of observations in our sample and m is again the number of genotype groups.

Table 1.   Multilocus genotypes by latent class (NWCS 224 study)
Patternrs3814055rs2276706rs2461825rs6785049rs2472682rs2276707rs1054191rs3814057Group sizeProbability in latent class 1/class 2Uncertainty
  1. †Significant on the basis of analysis of covariance.

Latent class 1
1AAAAAAAAAACCTTGG10.931/0.0690.139
2AAAAAAAGGGCCCTGG10.903/0.0970.194
3AAAGAAAAAACCTTGG10.977/0.0230.047
4AAAGAAAAAGCCTTGG10.848/0.1520.304
5AAAGAAGGAGCCCCGG10.892/0.1080.215
6AAAGACGGGGCTCCGG10.941/0.0590.117
7AAGGAAAAAGCTTTGG10.875/0.1250.250
8AAGGAAAGAACCCTGG10.781/0.2190.437
9AAGGACAGAACTCTGG10.972/0.0280.056
10ACAGAAAGAACCCTAG10.822/0.1780.356
11ACAGACAGAACTCTGG10.904/0.0960.192
12ACAGACGGAGCTCCGG10.576/0.4240.849
13ACAGACGGGGCTCCGG10.865/0.1350.270
14ACGGACAAAGCCTTAG10.936/0.0640.128
15ACGGACAGGGCTCTAG10.985/0.0150.030
16CCAGAAAAAGCCTTAG10.841/0.1590.318
17CCGGAAAGAGCCCTAG10.578/0.4220.843
18CCGGAAGGAGCCCCAA10.709/0.2910.581
19CCGGAAGGAGCCCCAG10.936/0.0640.127
20CCGGACAAAGCTTTAG10.697/0.3030.606
21CCGGACAGAGCTCTAG10.959/0.0410.083
22CCGGACGGGGCTCCAG10.808/0.1920.385
23AAAAAAAGAGCCCTGG20.528/0.4720.945
24AAAGAAAGAACCCTGG20.907/0.0930.185
25AAAGAAGGAACCCCGG20.993/0.0070.015
26AAAGACAAAACTTTGG20.968/0.0320.063
27AAAGACAGAACTCTGG20.987/0.0130.025
28AAGGACGGAACTCCGG20.948/0.0520.104
29AAGGCCAAAATTTTGG20.981/0.0190.038
30AAGGCCAGAGTTCTGG20.986/0.0140.028
31AAGGCCGGAATTCCGG20.594/0.4060.812
32ACAGAAAAAGCCTTAG20.989/0.0110.023
33ACAGAAAAGGCCTTAG20.949/0.0510.102
34CCAGAAGGGGCCCCAG20.891/0.1090.219
35CCGGAAAGAGCCCTAA20.952/0.0480.096
36AAGGCCAGAATTCTGG40.752/0.2480.495
37ACGGAAAGAGCCCTAG40.999/0.0010.001
38ACAGAAGGAGCCCCAG50.998/0.0020.005
39AAAGACAGAGCTCTGG70.999/0.0010.001
40CCAGAAAGGGCCCTAG80.993/0.0070.013
41ACAGAAGGGGCCCCAG100.989/0.0110.022
42ACGGAAGGAGCCCCAG101.000/0.0000.000
43ACGGACGGAGCTCCAG111.000/0.0000.000
44ACGGACAAAGCTTTAG131.000/0.0000.000
45ACAGAAAGAGCCCTAG181.000/0.0000.000
46ACGGACAGAGCTCTAG331.000/0.0000.000
47CCGGAAAGGGCCCTAA431.000/0.0000.000
48CCGGAAGGGGCCCCAA591.000/0.0000.000
Latent class 2
49AAAAAAAGAACCCTGG10.412/0.5880.823
50AAAGAAAAAACCTTAG10.130/0.8700.260
51ACAAAAAGGGCCCTGG10.336/0.6640.672
52ACAGACGGAACTCCGG10.165/0.8350.331
53ACAGACGGGGCCCCAG10.482/0.5180.965
54ACGGAAAAAGCCTTAG20.269/0.7310.537
55ACGGACAAAACTTTAG20.457/0.5430.914
56CCGGAAAAAGCCTTAA30.334/0.6660.669
57ACAGAAAGGGCCCTAG50.010/0.9900.021†
58ACGGACGGGGCTCCAG80.264/0.7360.528†
59CCGGAAAAGGCCTTAA80.135/0.8650.270

First, we consider application of a fixed effects model for this setting. Let y be an n×1 vector denoting the trait for the observations and X represent the n×p matrix for the covariates, including clinical, demographic and environmental factors. Here p is the number of covariates including the intercept. Let genotype groups be indexed by i, where i=1,…,m. Further suppose that ni is the number of individuals in the ith group and Σi ni=n. A generalized linear fixed effects model for this setting can be written as

image( (2.1))

where yi and Xi are respective ni rows of y and X for the ith genotype group, Zi=Jni is an ni×1 vector of 1s, β=(β1,…,βp)T is a vector of parameters corresponding to the effects of covariates, γi is a scalar representing the ith genotype group effect constrained by Σi γi=0; E[·] denotes expectation and g(·) is a link function. In this paper, for clarity of presentation, we focus on a quantitative trait and the identity link, so model (2.1) reduces to

image( (2.2))

Here we further assume inline image, where φni is an ni-dimensional Gaussian probability density function. In this setting, the null hypothesis of no association between the trait and the genotypes is given by H0:γi=0 for all i. A test of this null hypothesis is commonly performed on the basis of an F-test with degrees of freedom increasing with the number of genotype groups. Detailed discussions of this modelling framework and associated testing procedures can be found, for example, in McCulloch and Searle (2001).

Under a classical mixed effects model (McCulloch and Searle, 2001), we assume that the multilocus genotype effects are random, arising from a single Gaussian distribution. To distinguish between the fixed effects model of equation (2.2) and a mixed effects model with random group effects, we use the following model notation to represent the latter:

image( (2.3))

where

image( (2.4))

is the random effect for the ith multilocus genotype, again i=1,…,m, the bi are independent of ɛi and inline image is a univariate Gaussian probability density function with mean 0 and variance inline image.

2.2. Mixture model

In the mixture modelling setting, we instead assume that each bi is independently drawn from a mixture of K Gaussian distributions with mean μk and variance inline image for k=1,…,K. Formally, we have

image( (2.5))

in place of equation (2.4), where K is the number of Gaussian components in the mixture and πk is referred to as the mixing parameter, which is subject to the constraints Σk πk=1 and πkgeqslant R: gt-or-equal, slanted0. For identifiability, we require that

image

In the single-component mixed effects models, the expectation of random effects is set to 0 for identifiability of the model's intercept term. It is desirable to keep this property for the mixture model and therefore the following constraint is imposed:

image( (2.6))

In the case that K=1, this reduces to a single Gaussian component mixed effects model, i.e. equation (2.4), as discussed in Foulkes et al. (2005).

Assuming an equal variance for all Gaussian components provides stronger numerical stability in the estimating procedure, as noted by Verbeke and Molenberghs (2000), and guarantees the existence of maximum likelihood under the mixture random-effects setting (Magder and Zeger, 1996). The adequacy of this assumption is further discussed in Section 4.2.

The distribution of yi is then given by

image( (2.7))

where inline image. The likelihood of the data is represented as

image( (2.8))

where θ=(β,σ,σb), μ=(μ1,…,μK), π=(π1,…,πK) and inline image is a Gaussian likelihood function of ni dimensions. In this paper, we adopt the expectation–maximization algorithm for model fitting described in Komarek (2001) which is the basis for the existing optimization routines for mixed effects models that are provided in SAS (PROCMIXED) and R nlme package (Pinheiro et al., 2009).

2.3. Detecting association

Two stages of hypothesis testing and exploration are of interest within the mixture modelling framework proposed. First, we aim to determine whether a single-component model (i.e. K=1) provides an adequate fit to the data. If not, we conclude that there is variability in the trait that can be explained by the gene(s) under consideration. Further exploration of the data, as described in Section 2.4, will then allow us to characterize the latent class structure. If a single-component model (K=1) is indeed adequate, then we aim to test whether the genotype effects in such a model has significant variability. If the data suggest that this is so, we again conclude that there is variability in the trait that can be explained by the gene(s) under study.

Given a fitted single-component model, we first test whether the model fits the data adequately. In this paper, we apply the goodness-of-fit test that was suggested by Verbeke and Lesaffre (1996) and Verbeke and Molenberghs (2000) to determine the adequacy of the fitted model. First we define a stochastic variable Ui for each genotype group i as follows:

image( (2.9))

where m again denotes the number of observed genotype groups, ai is a prespecified vector of constants and inline image and inline image are the estimates of β and Vi respectively. If the model assumed is correct, then the Ui, i=1,…,m, are independently and identically distributed according to a univariate normal distribution. This is tested formally by using a Shapiro–Wilk test, such that a model is rejected if the corresponding Shapiro–Wilk statistic SW({Ui}) is smaller than the threshold value corresponding to significance level α. Although the goodness-of-fit test can be performed for any ai, a good choice of ai can improve the power of the test. Verbeke and Lesaffre (1996) suggested choosing ai to maximize inline image so that the variability in inline image that is due to the random effects is large compared with the variability from the error term. This is achieved by letting ai equal the eigenvector corresponding to the largest eigenvalue of inline image. In our setting of univariate random effects, this simplifies to choosing ai to satisfy inline image.

Rejection of the single-Gaussian assumption (H0:K=1) both confirms the existence of association and implies the need for a more complex model of association. If we fail to reject this null hypothesis, association may still exist by way of a random effect that arises from a single Gaussian distribution. Under this simpler model, a test of the null hypothesis of no association between multilocus genotypes and the trait is given by an omnibus test of inline image, as described in Foulkes et al. (2005). Since this null hypothesis is testing a parameter at a boundary, a likelihood ratio test statistic is approximately distributed as inline image where inline image represents a distribution with a point mass at zero (Stram and Lee, 1994).

The detection of association therefore potentially involves two tests and a Bonferroni adjustment is applied to control the familywise error rate. Further characterization of association in this framework is achieved through examining the best linear unbiased predictors of the random effects and the corresponding prediction intervals, as described in Verbeke and Molenberghs (2000) and Foulkes et al. (2005). Sensitivity of the performance of these tests to the Gaussian mixture assumption is studied in Section 4.2.

2.4. Defining latent classes

Once the single Gaussian distribution assumption is rejected by the goodness-of-fit test, the multilocus genotype effects are then modelled by a mixture of Gaussians. An evaluation of the approach is presented in Section 4.2. The number of Gaussian components (K) is determined by using Akaike's information criterion (Akaike, 1973). In the application below, we limit consideration to 2leqslant R: less-than-or-eq, slantKleqslant R: less-than-or-eq, slant5 for ease of interpretation of the final results. The model with the lowest corresponding Akaike information criterion value is selected. Alternative approaches for determining K can be found, for example, in McLachlan and Peel (2000). After an appropriate model has been identified, we aim to characterize further the structure of association. We begin by defining some additional notation by recalling the mixture model that was described in Section 2.2. The mixture model can be interpreted as a latent class model (McLachlan and Peel, 2000) in which a multilocus genotype arises from one of the K latent classes that have differential effects on the trait, each specified by a separate Gaussian distribution. We define the latent class membership of the ith genotype group by a vector of indicator variables, di=(di1,…,diK), where

image

We assume that dii=1,…,m, are independently and identically distributed according to a multinomial distribution consisting of one draw on K categories with probabilities π=(π1,…,πK), i.e.

image

It leads to

image

Although di is unobservable, we can estimate the posterior probability that dik=1. This is given formally by

image( (2.10))

where inline image is the estimated mixing proportion for Gaussian component k, inline image and inline image are restricted maximum likelihood estimates of μ and θ=(β,σ,σb) respectively and fik is the kth Gaussian component of the density of yi, given by

image( (2.11))

We classify the ith genotype group as belonging to the latent class with the highest corresponding posterior probability, i.e., if c(Gi) denotes the latent class to which the ith genotype group is assigned, we let inline image. A scaled version of the classification uncertainty, as suggested by Fraley and Raftery (2002), is defined as

image( (2.12))

This uncertainty measure ranges from 0 to 1 and is defined such that, if a genotype is classified into a latent class with a high posterior probability, then the classification has a low uncertainty.

3. Example

In this section we apply the mixture modelling approach to data derived from the AIDS Clinical Trials Group ‘New works concept sheet’ (NWCS) 224 study, which was an investigation aimed at identifying genetic factors that predict lipid abnormalities in antiretroviral-treated individuals infected with human immunodeficiency virus, type 1. A complete description of the study population can be found in Foulkes et al. (2006, 2007). In this paper, we investigate the relationship between the pregnane X receptor gene, which is also known as NR1l2, and fasting non-high-density lipoprotein. Analyses are adjusted for potential confounding effects of age, gender, use of lipid lowering therapy, CD4 cell count and antiretroviral therapy drug exposures. A total of n=306 Caucasians are included in the analysis. Individuals with missing genotypes, unknown drug exposures, short durations of exposure to a specific class of drugs or a short washout period are excluded. Eight SNPs within pregnane X receptor are considered: rs1054191, rs2276706, rs2276707, rs2461825, rs2472682, rs3814055, rs3814057 and rs6785049. Each SNP is coded as a three-level factor variable, resulting in 59 observed genotype groups. A complete listing of the multilocus genotypes is given in Table 1.

Diagnostic quantile–quantile plots (see Section 4.2) in the initial analysis show that the random effects deviate from the Gaussian mixture assumption with a long-tailed distribution. Box–Cox transformation, Box--Cox(y)=yλ−1/λ, with λ=0.8, is therefore applied to the trait (non-high-density lipoprotein) and a mixture model is refitted. A further discussion of Box–Cox transformation for Gaussian mixture models is given in Yeung et al. (2001). Visual inspection of the distribution of inline image, as well as the residuals from the fitted model, using quantile–quantile plots against a Gaussian distribution for each latent class, suggests that the Gaussian mixture assumption is reasonable. Application of the goodness-of-fit testing leads us to reject the null hypothesis of a single Gaussian distribution on the random effects (Bonferroni adjusted p-value, 0.040) and a two-component model is selected on the basis of the Akaike information criterion value. On the basis of the posterior probability estimates, we assign about 90% (n=273) of the subjects across 48 genotype groups to latent class 1 and the remaining 10% (n=33) of the subjects across 11 genotype groups are assigned to latent class 2. The corresponding uncertainty for each latent class assignment is given in Table 1. The average uncertainties for assignments to latent classes 1 and 2 and their (10th, 90th) percentile are 0.20 (0.00,0.59) and 0.54 (0.26,0.91) respectively. By comparing the odds ratios for each SNP between these two classes, SNPs rs2461825 and rs3814057 appear to be the most influential.

A visual representation of the effects of the genotype groups with classification uncertainty less than 0.5 is provided in Fig. 1. For genotype groups within latent class 1, the posterior means of the random effects of the lipid outcome cluster around zero (marked by a vertical line). In contrast, for genotype groups in latent class 2, the estimates of the random effects shift upwards. This suggests that individuals with genotypes belonging to the second latent class are likely to have higher non-high-density lipoprotein cholesterol levels than individuals with genotypes belonging to the first latent class. Finally, for comparison, a fixed effects analysis of covariance, also adjusted for patient level covariates, is performed using the same data. This approach similarly provides evidence for association (overall p-value, 0.007) and the effects of genotype patterns 57 and 58, as given in Table 1, are significantly different from 0 (p-values, 0.013 and 0.008). The remaining nine groups in latent class 2 were not identified.

Figure 1.

 Classification plot for the NWCS study (•, latent class 1; □, latent class 2): 13 genotype groups with uncertainty higher than 0.5 are removed, including seven, four, one and one groups with sizes 1, 2, 3 and 8 respectively

4. Simulation study

The results of a simulation study are presented in four parts. Part one studies the power of the mixture models under various conditions. Part two presents a sensitivity study for the Gaussian mixture assumption for the random effects. Part three compares the fitted distribution of inline image and the true distribution for mixture models. The connection between the misclassification and uncertainty is also studied. Finally, part four compares the performance of the mixture models with the lasso, which is a popular high dimensional data analysis approach (Tibshirani, 1996). Each simulation study is based on N=1000 data sets, and each data set is of size n=1000. Moreover, for the comparison of power under various displacement and scale values, and number of mixture components, all simulated data sets concerning power estimation are rescaled such that the total standard deviation of bi is equal to 0.5. The noise level σ is set to 1 in all cases.

Each simulation begins by generating a set of multilocus genotypes and then simulating the trait according to model (2.7). For our simulations, we define binary genotype variables (e.g. an indicator for the presence of at least one variant allele at a given SNP locus), with probabilities P(AA)=0.6 and P(Aa/aa)=0.4. The conditional probability distribution of genotypes across two neighbouring loci is P(BB|AA)=0.7 and P(BB|Aa/aa=0.4). Genotypes are simulated sequentially one locus at a time such that, given the genotype at locus i, the genotype for locus i+1 is generated according to the conditional probabilities.

4.1. Power study

We begin by investigating how changes in the displacement parameter for the random effects, given by δ=|μ2μ1|, and the scale parameter, given by σb, alter the power to detect overall association and latent class structure under a two-component mixture model for the random effects. In simulating the data, we assume that the random effects arise from a two-component model, i.e. inline image. For comparison, the simulated data sets are rescaled such that the total standard deviation for bi is equal to 0.5 under all simulation conditions. The power for δ=0.5,1,4 at σb=0.1 is reported for increasing number of loci. Next, the power is recorded for σb=0.1,0.2,0.5 at δ=4 and with increasing numbers of loci. The results are presented in Fig. 2. The power to detect overall association is defined as the proportion of simulations that result in rejecting either the single-Gaussian assumption for random effects through the goodness-of-fit test or the omnibus test at the Bonferroni-adjusted 0.05-level. The power to detect latent structure is defined as the proportion of simulations that result in rejecting the single-Gaussian model in favour of a model with two or more components, based on the Shapiro–Wilk goodness-of-fit test. Notably, this definition of power implies that the data-generating distribution can be captured with a mixture of components (i.e. there is latent structure) in the case that the single-Gaussian model is rejected. Further, power for detecting latent structure is defined loosely in terms of identifying more than one component, i.e. power is defined as Pr(reject single-component assumption | multicomponent model). In the simulation study, we use two-component models as realizations from the alternative space and calculate Pr(reject single-component assumption | 2-component model). Although not explicitly part of our definition of power, we additionally suggest coupling the proposed approach with close examination of quantile–quantile plots for inline image for each fitted component in Section 4.2.

Figure 2.

 Power to detect overall association inline image and latent structure inline image under (a) various displacements (σb=0.1) and (b) various scale values (δ=4)

As expected, a larger displacement raises the power to discover latent structure. For example, from Fig. 2(a), we see that the power to detect latent structure for a five-locus genotype (x-axis) at σb=0.1 increases from approximately 60% to 100% as δ increases from 0.5 to 4. However, the differences in the power to detect overall association is small in the context of a large number of genotype groups. Smaller values of σb also lead to improvements in the power to detect latent structure. For example, from Fig. 2(b), for five-locus genotypes and δ=4, by reducing σb from 0.5 to 0.1, the power for detecting latent structure increases from approximately 90% to 100%. Such a decrease in σb leads to a slight decrease in the power to detect overall association.

4.2. Sensitivity analysis

Sensitivity analysis is performed to investigate the validity of the testing result when there is a deviation from the modelling assumption for the random effects. In this analysis, bi is simulated following four types of distribution—single Gaussian, log-normal, uniform and beta(inline image). For all cases, bi is rescaled to have standard deviation σb equal to (0.01, 0.1, 0.3, 0.5, 0.7, 1) with zero mean, and σ=1. Single-component models are fitted to these simulated data sets. The proportion of rejected null hypotheses for omnibus and goodness-of-fit tests is obtained on the basis of 1000 simulated data sets under each condition and displayed in Fig. 3. For the omnibus test, the proportions of rejected null hypotheses are very close for all cases; only minor differences are found when σb is between 0.3 and 0.7. For the goodness-of-fit test, which is testing for more than one Gaussian component, proportions of rejected null hypotheses are similar under low signal levels (i.e. σbleqslant R: less-than-or-eq, slant0.1). At medium and higher signal levels, data sets with log-normal random effects give much higher proportions of rejected null hypotheses than others. Additional simulations (which are not displayed in the figures) show that the omnibus test under Gaussian mixture random effects with three and four components has similar proportions of rejected null hypotheses to those of single Gaussian (with less than 4% differences when 0.3leqslant R: less-than-or-eq, slantσbleqslant R: less-than-or-eq, slant0.7).

Figure 3.

 Proportion of rejected null hypotheses under non-Gaussian random effects for (a) the goodness-of-fit test and (b) the omnibus test: inline image, log-normal; inline image, beta; inline image, uniform; inline image, single Gaussian

On the basis of the above analysis, we note that rejection of the single-Gaussian assumption for random effects does not imply that the true distribution of the random effects is close to a mixture of Gaussian distributions. We therefore introduce two diagnostic tools to measure how well the mixture of Gaussian distributions approximates the actual distribution for the random effects. First we study the measure of uncertainty of the classification.

Random effects under the three non-Gaussian distributions that were described in the previous paragraph and a two-component Gaussian mixture (σb=0.5, |μ1μ2|=1 and σ=1) are simulated; their total standard deviations are rescaled to equal 0.5 for comparison. Two-component Gaussian mixture models are fitted to these simulated data sets. The empirical cumulative distribution function of the uncertainty is obtained from fitted models for each simulation; their average is shown in Fig. 4. For the Gaussian mixture and log-normal models, most uncertainties are low; for the beta and uniform models there are more high uncertainty values. On average the 75th percentile of the uncertainty is about 0.1 for the log-normal, 0.25 for the Gaussian and 0.6 for the beta and uniform models.

Figure 4.

 Average empirical cumulative distribution function of the measure of uncertainty: inline image, log-normal; inline image, Gaussian mixture; inline image, beta; inline image, uniform

To study further the source of high uncertainty for the beta and uniform models, we study the relationship between the uncertainty and the shape of the distribution for the random effects on the basis of four simulated data sets under different random-effects distributions. Fig. 5 shows the kernel density estimation of inline image and the measure of uncertainty corresponding to each inline image. In general the uncertainty is higher at the boundary of two latent classes. For the beta and uniform distributions, there is a stronger concentration of high uncertainty values at the boundary of two latent classes and the average uncertainty is therefore inflated. For the log-normal distribution, the measure of uncertainty is low as the Gaussian mixture generally gives a good approximation to the long-tailed distribution as pointed out by McLachlan and Peel (2000).

Figure 5.

 Random-effects distribution and uncertainty (inline image, smoothed estimated density for inline image; |, uncertainty for genotype group i, located at inline image on the x-axis; ⋮, separation of inline image from different latent classes): (a) Gaussian mixture; (b) log-normal; (c) beta; (d) uniform

The second method to detect deviation from the Gaussian mixture assumption for the random effects is to inspect the quantile–quantile plots for inline image against the Gaussian distribution for each fitted component. If a mixture of Gaussian distributions is a good approximation to the true underlying distribution, we expect the distribution of inline image for each fitted component to be close to Gaussian. Thus, although not intended to make generalizations, quantile–quantile plots in this setting serve as an important diagnostic tool (similar to the usual regression setting) to assess the appropriateness of modelling assumptions. The quantile–quantile plots for the four distributions are shown in Fig. 6. For log-normal random effects, the plots give a strong indication of deviation from the assumption. For the beta and uniform models, the tails of the quantile–quantile plots raise concern. The information that is gained from using the quantile–quantile plots as a diagnostic tool depends on the quality of classification. The measure of uncertainty as presented in Section 2.4 provides a tool to evaluate the quality of the classification and should be used as a complement to the quantile–quantile plots.

Figure 6.

 Quantile–quantile plots for inline image for each fitted component: (a) Gaussian mixture; (b) log-normal; (c) beta; (d) uniform

The assumption of a common variance for all latent classes is mainly for numerical stability. To study the adequacy of the assumption, we perform a simulation by fitting common variance models to simulated data sets under various variance ratios inline image assuming that inline image, with total standard deviation for random effects rescaled to be 0.5. The results show that, given a moderate variance ratio, i.e. less than or equal to 3, the deviation from the common variance assumption does not have significant effects on the power for detecting association or the number of components selected. The bias in estimation of displacement δ is mild. However, the bias in estimation of inline image is stronger and we observe a higher rate of misclassification and uncertainty of classification. Again, we can use the diagnostic plots to detect whether there is a strong deviation from the common variance assumption.

Simulation is also performed to investigate the sensitivity to outliers and skewness in the error. The results show that strongly skewed noise and large outliers may inflate the type I error rate for the goodness-of-fit test and therefore falsely reject the single-Gaussian assumption. However, it also significantly inflates the measure of uncertainty for those falsely rejected to the range 0.7–0.9. Therefore high uncertainty also signals the presence of outliers and skewed noise.

4.3. Model comparison and misclassification

We compare inline image with the true distribution of yi when the data arise from a three-component mixture model. 1000 data sets, each with a sample size of 1000, are simulated by using the three-component model that was described in Section 4.2. We set the separation of the components equal to the noise level, i.e. μ2μ1=μ3μ2=σ=1 and σb=0.1. Mixture models with 1–4 components are fitted to these simulated data sets. To compare the performance of these models quantitatively, we calculate the Kolmogorov–Smirnov distance as a measure of the dissimilarity between the cumulative distribution function of yi for the true model and the fitted models. The distributions of the Kolmogorov–Smirnov distance under 1–4-component models are displayed in Fig. 7. Models with three or four components appear to give the best fit; however, the Kolmogorov–Smirnov distance for the overfitting four-component model has larger spread due to higher variance for the estimates of parameters. On the basis of the goodness-of-fit test and Akaike information criterion approximately 58% of simulations are determined to have three components, about 34% are chosen to have two-component models and the remaining near 8% are chosen to have one component. Few fitted models are chosen to have more than three components. Notably, the large separation between components as described may not be observed in practice. Additional simulations, for which the separation between components is reduced by half, result in centres of the estimated density curves for 2–4-component models that are close together and approximately 84% and 16% of the simulated data sets are chosen to have two- and single-component models respectively.

Figure 7.

  Kolmogorov–Smirnov distance between the true and fitted distributions of the trait: inline image, one-component model; inline image, two-component model; inline image, three-component model; inline image, four-component  model

To illustrate the relationship between misclassification and uncertainty, a simulated data set with two latent classes is generated with δ=2 and σb=0.3. The posterior means, 95% prediction intervals, group sizes and classification uncertainties for the 32 genotype groups that are observed in this simulated data set are presented in Fig. 8. The vertical dotted line in the middle panel divides the genotype groups into two latent classes such that genotype groups with posterior means to the left of the line are assigned to latent class 1 whereas genotype groups with posterior means to the right of this line are assigned to latent class 2. Genotype groups with relatively large or small posterior means are correctly classified with low uncertainty. The two genotype groups that have the highest uncertainties are misclassified. On the basis of this finding it appears that declaring the genotype groups with high uncertainty as unclassifiable may be appropriate and would reduce the overall misclassification error.

Figure 8.

 Relationship between misclassification and uncertainty (•, from component 1; □, from component 2): genotype groups with posterior means to the left of the vertical dotted line are assigned to latent class 1; otherwise, they are assigned to latent class 2; the two genotype groups that have the highest uncertainties are misclassified

4.4. Comparison with lasso

Several alternative high dimensional data methods have been applied to SNP data, including classification and regression trees (Foulkes et al., 2004), random forests (Breau et al., 2004), multivariate adaptive regression splines (Lin et al., 2008) and logic regression (Schwender and Ickstadt, 2008). Although a comprehensive assessment of the relative performance of these approaches is beyond the scope of the present paper, we offer a brief comparison with one popular regression-based approach: the lasso (Tibshirani, 1996). The lasso is an approach that focuses on variable selection, i.e. selecting a subset of variables (SNPs in our setting) that are associated with the trait under study. We begin with seven binary SNPs and let σb=0.1 and σ=1. Two scenarios are considered. First we assume that the trait has the distribution

image

Secondly, we assume that

image

where ‘_’ can be 0 or 1. Scenario 1 can be interpreted as a model for a genetic pathway controlled by SNP 1 and SNP 2—when both SNPs become ‘1’ the mean trait value rises. Similarly, scenario 2 models a slightly more complicated pathway controlled by three SNPs where SNP 7 determines the direction of displacement of the trait values. Mixture models and the lasso are then applied to the same data sets simulated under these two scenarios. For comparison, we define selection power as the proportion of simulations for which the lasso selects at least one of the correct SNPs associated with the trait. In this paper, R package LARS is used for fitting lasso regression and the results are summarized in Table 2.

Table 2.   Comparison of mixture models and the lasso
ScenarioOverall power, mixture modelsPower to detect latent class, mixture modelsSelection power, lasso
10.990.701
20.990.720.79

On the basis of the first scenario, the lasso has a similar ‘power’ to that of the mixture models and it can pinpoint important SNPs with moderate noise level—on average 34% of selected SNPs are irrelevant. For the second scenario, the performance of the lasso is markedly lower as the SNP selection becomes more noisy—on average 65% of selected SNPs are irrelevant. The difference is mainly due to inadequate modelling of multi-SNP interactions in lasso regression. In the mixture modelling approach, SNP-by-SNP interactions are in a sense embedded in each genotype pattern and thus do not require explicit modelling. In contrast, the lasso considers each SNP as a separate variable and directly provides a subset of important SNPs.

5. Discussion

In this paper, we demonstrate the application of mixture modelling as a tool for discovery of multilocus genotype–trait associations in the context of population-based genetic investigations. In this framework, the strength of association is measured by the spread of bi (as quantified by inline image) in a single-component model, and by both inline image and the separation between components (as measured by the difference between the μk) in multiple-component models. This modelling approach reduces to the usual mixed effects model in the presence of a single component, while offering a broader framework that may be more suitable for some complex disease association settings. Specifically, as described in Foulkes et al. (2008), under a founder model, in which a single allele is associated with the trait, and either a dominant or recessive genetic model is assumed, the mixed effects model with random effects arising from a single normal distribution may be inappropriate. A mixture model may better reflect the underlying association in this case since it is comprised of multiple components that could, for example, correspond to the presence (or absence) of a founder allele. A similar point was introduced in Roeder (1994).

Computationally, we adopt the algorithm that was described in Komarek (2001) to fit the mixture model, which is relatively simple to implement in R. However, it suffers a speed problem when both the sample size and number of SNPs become large (i.e. n>3000 and number of SNPs greater than 10). Future implementation based on Van Dyk (2000) may improve the computational efficiency.

Population substructure is a potential confounder in population-based association studies. In the example that was provided in Section 3, we include race or ethnicity as a covariate in our model, which addresses potential confounding by this self-declared category. Notably, if information on substructure informative loci had been collected, we could additionally apply a principal component analysis approach (Price et al., 2006) and use the resulting principal components in our regression framework. In family-based studies, an extra layer of clustering is present due to the familial relations. In this setting, a non-nested random effect can be added to the model to account for additional within-family correlations.

Although the application of a mixture model addresses the degrees-of-freedom challenge that is inherent in the fixed effects modelling setting, it remains unwieldy in higher dimensional settings. Specifically, as the number of SNPs increases, the number of genotype groups will rapidly approach the number of individuals in a given sample, rendering model fitting untenable. A limited number of genes, therefore, can be selected on the basis of an a priori scientific hypothesis when applying the mixture modelling method. In addition, a combination of mixture modelling and machine learning techniques, such as logic regression (Ruczinski et al., 2003) or random forests (Breiman, 2001), may be appropriate. This would involve developing an approach that is similar to the model-based recursive partitioning approach recently described by Zeileis et al. (2008), with the inclusion of a mixture model. These approaches may also serve as useful tools for post hoc analysis of signature differences across latent classes. Further extensions will allow for application of this approach to the study of ambiguous phase haplotype–trait associations.

Finally, major results of the simulation are summarized as follows:

  • (a) the total power is over 80% under moderate signal strength;
  • (b) the power of the omnibus test is insensitive to the Gaussian mixture assumption, and under low signal strength the goodness-of-fit test is also insensitive to the Gaussian mixture assumption;
  • (c) deviation from the Gaussian mixture assumption may be detected by using uncertainty measure and quantile–quantile plots for inline image;
  • (d) deviation from the common variance assumption for all latent classes does not have a significant effect on power or number of components selected, but it causes bias in the estimation of component variances and higher misclassification rates;
  • (e) mixture modelling has a better performance than the lasso in detecting association when there is a strong interaction between loci; however, the lasso can pinpoint important loci that are associated with the trait.

Acknowledgements

Support for this research is provided by National Institutes of Health–National Institute of Allergy and Infectious Diseases grant R01AI056983, National Institutes of Diabetes and Digestive and Kidney Diseases grant R01DK021224 and National Institute of Allergy and Infectious Diseases grants P30AI042845 and AI38858. We also thank the referees for several helpful suggestions.

Ancillary