## 1. Introduction

Population-based genetic association studies of unrelated individuals provide us with a rich source of data for investigating the genetic basis of complex diseases. Joint analysis of multiple genetic markers within and across genes is increasingly popular as these analyses may lend additional insight into associations. At the same time, model-based methods play an important role in the analysis of data derived from these studies because they serve as a flexible framework for incorporating covariate information, including environmental, demographic and clinical factors. Analysis of covariance is one commonly used testing framework for characterizing genotype–trait associations based on multilocus genotypes or haplotypes (Tzeng *et al.*, 2006; Schaid *et al.*, 2002). The majority of these analyses apply either Wald-type or score-type statistics for testing association. In contrast with single single-nucleotide polymorphism (SNP) based studies, these analyses are hindered by potentially large degrees of freedom. For example, given *n* biallelic SNPs, there are 3^{n} possible multilocus genotypes. Although in practice the actual number of such patterns is limited by sample size and linkage between loci, the resulting number of groups in an analysis-of-covariance model can increase rapidly. Thus, as we include more SNPs in the analysis, the degrees of freedom for the corresponding test statistic can become unwieldy, leading to a reduction in statistical power (Tzeng *et al.*, 2006). In many instances, the degrees of freedom are spent on genetic variants that are relatively rare and for which no true association is detectable owing to insufficient power. Such a reduction of power is a main cause of the high false negative rate and non-consistently reproducible findings of association studies (Tzeng *et al.*, 2006).

As a means of addressing this degrees-of-freedom challenge, a mixed effects modelling approach has been proposed recently for the analysis of data arising from genetic association studies, as described in Foulkes *et al.* (2005, 2007). A similar global testing approach has also been described for the analysis of gene expression data (Goeman *et al.*, 2004). Mixed effects models provide a flexible statistical framework for controlling for potential confounders and identifying interactions between multiple genes and environmental factors that explain the variability in the measured trait. This is achieved simply through inclusion of these quantities as covariates in fixed and random-effects design matrices respectively. Within this framework, the associations between genetic variants and a trait are detected with the application of a single degree of freedom test. Notably, the degrees of freedom of this omnibus test are unaffected by the number of genetic variants and the approach is easily implemented by using existing software tools, including SAS PROCMIXED and the nlme package (Pinheiro *et al.*, 2009) in R. Mixed effects modelling hence provides a complementary approach to study multilocus genotypes involving a large number of potential informative genetic patterns.

In this paper we propose a more general mixture modelling framework. This framework aims to explore association between a single trait, such as a quantitative measure of disease progression, and multilocus genotypes by testing for the existence of association, and then characterizing this association as a latent class structure. This is a natural extension of the mixed effects model approach for association studies, in which a Gaussian mixture distribution is assumed for random genotype effects. Indeed, in the case that a single Gaussian distribution is appropriate for the random effects, the mixture model that we describe herein reduces to the usual mixed effects modelling framework that was presented in Foulkes *et al.* (2005). The primary advantages of this paradigm over classical analysis-of-covariance and mixed effects models include that the mixture modelling framework

- (a) addresses the degrees-of-freedom challenge that is inherent in application of the usual fixed effects analysis of covariance for multilocus genotype,
- (b) relaxes the restrictive single-Gaussian assumption of the mixed effects model as described previously and
- (c) offers an exploratory framework for discovery of latent class structure.

Several approaches to relax the distribution assumption of random effects in the context of a mixed effects model have been proposed in the general biostatistics literature. See, for example, Magder and Zeger (1996) and Zhang and Davidian (2004). In this paper, we consider modelling the random effects as a mixture of Gaussian distributions as described in Verbeke and Lesaffre (1996), which can accommodate a broad class of distributions, including multimodal and highly skewed distributions. To our knowledge, application of the mixture modelling framework for exploration of multilocus genotype–trait associations has not been described. In a recent paper, Schumacher and Kraft (2007) proposed an application of a Bayesian latent class model with a mixture prior to select signal-bearing SNPs from a large number of loci in the context of a genomewide association study. This approach is notably different from the mixture modelling approach that is described herein. Specifically, Schumacher and Kraft (2007) reported posterior odds for each SNP based on an assumption about the prior distribution of the log-odds ratios. A fully Bayesian approach is applied with estimation achieved via Gibbs sampling. The goal of Schumacher and Kraft (2007) is to provide shrinkage estimates of single SNP effects by drawing strength from the totality of the data (Hoggart *et al.*, 2008; Lunn *et al.*, 2006). In the present study, an alternative modelling fitting paradigm is applied and, importantly, the aim is to group individuals (rather than SNPs) on the basis of multiple (rather than single) SNPs within and across genes. This provides a framework for discovery of combinations of markers that together explain the variability in the trait under study. In addition, a key feature of the approach proposed is that the number of components in the mixture distribution is data driven rather than assumed. This offers additional flexibility for characterizing complex multilocus genotype–trait associations.

This study is organized as follows. In Sections 2.1 and 2.2, we begin by introducing our notation and the mixture modelling framework. Here, we focus on the application of this modelling framework to data derived from genotype–trait association studies. In Sections 2.3 and 2.4 we describe a testing and model selection framework for determining the appropriateness of the mixture model and discovering latent structure. An application to real data arising from a study of antiretroviral-associated dyslipidaemia in human immunodeficiency virus infection is described in Section 3 and a simulation study is presented and discussed in Section 4. Finally, we offer a discussion of our findings in Section 5.