### Abstract

- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

The detection of loci contributing effects to complex human traits, and their subsequent fine-mapping for the location of causal variants, remains a considerable challenge for the genetics research community. Meta-analyses of genomewide association studies, primarily ascertained from European-descent populations, have made considerable advances in our understanding of complex trait genetics, although much of their heritability is still unexplained. With the increasing availability of genomewide association data from diverse populations, transethnic meta-analysis may offer an exciting opportunity to increase the power to detect novel complex trait loci and to improve the resolution of fine-mapping of causal variants by leveraging differences in local linkage disequilibrium structure between ethnic groups. However, we might also expect there to be substantial genetic heterogeneity between diverse populations, both in terms of the spectrum of causal variants and their allelic effects, which cannot easily be accommodated through traditional approaches to meta-analysis. In order to address this challenge, I propose novel transethnic meta-analysis methodology that takes account of the expected similarity in allelic effects between the most closely related populations, while allowing for heterogeneity between more diverse ethnic groups. This approach yields substantial improvements in performance, compared to fixed-effects meta-analysis, both in terms of power to detect association, and localization of the causal variant, over a range of models of heterogeneity between ethnic groups. Furthermore, when the similarity in allelic effects between populations is well captured by their relatedness, this approach has increased power and mapping resolution over random-effects meta-analysis. *Genet. Epidemiol*. 2011. © 2011 Wiley Periodicals, Inc.35: 809-822, 2011

### INTRODUCTION

- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

Genomewide association studies (GWAS) have been extremely successful in identifying loci contributing genetic effects to a wide range of complex human traits. However, despite this success, the joint effects of these loci typically explain only a small proportion of the heritability [Manolio et al., 2009; McCarthy et al., 2008]. Furthermore, the loci identified through GWAS often extend over hundreds of kilobases, contain many genes and large numbers of variants with indistinguishable signals of association, occurring as a result of linkage disequilibrium (LD) across the region. The challenge is now to identify novel loci that contribute to the “missing” heritability of complex traits, and to refine the location of causal variants within already established loci in order to prioritize genes for followup through functional studies.

The vast majority of GWAS have been undertaken in populations of European descent [Rosenberg et al., 2010]. The availability of European-descent population cohorts, such as those made available by the Wellcome Trust Case Control Consortium [The Wellcome Trust Case Control Consortium, 2007], has expedited the use of “shared controls” between GWAS, reducing the burden of sample collection and genotyping [Zhuang et al., 2010]. Meta-analyses of European-descent GWAS have proved to be profitable in identifying additional complex trait loci by increasing sample size without the cost of additional genotyping [Barrett et al., 2009; Dupuis et al., 2010; Lango Allen et al., 2010; Voight et al., 2010]. This process has been greatly aided by the development of imputation techniques that allow the prediction of genotypes not typed on GWAS chips, but present on a higher density reference panel of phased haplotypes from the same, or a closely related population [Marchini and Howie, 2010]. Appropriate reference panels for European-descent populations have been made available through the International HapMap Project [The International HapMap Consortium, 2007, 2010] and at higher density through the 1000 Genomes Project [The 1000 Genomes Project Consortium, 2010]. These reference panels provide more complete coverage of common genetic variation throughout the genome, and thus will be more likely to explicitly include causal variants than will GWAS genotyping products. However, LD between common variants among European-descent populations will likely continue to hamper fine-mapping efforts, even with the large sample sizes accrued through GWAS meta-analysis.

Two of the key challenges in performing GWAS in other ethnic groups have been the lack of appropriate genotyping products and availability of well-matched imputation reference panels [Jallow et al., 2009]. Initial GWAS chips were designed to preferentially capture common genetic variation in Europeans [Rosenberg et al., 2010]. Underlying differences in the structure of LD between diverse populations reduced the efficiency of these genotyping products in other ethnic groups. However, more recent chips are less biased to European-descent populations, and GWAS are now increasingly undertaken, with great success, in other ethnic groups including Japanese [Kamatani et al., 2010; Kochi et al., 2010; Takata et al., 2010; Uno et al., 2010; Yamauchi et al., 2010], Chinese [Abnet et al., 2010; Chen et al., 2011; Wang et al., 2010], Koreans [Jee et al., 2010], Indian Asians [Chambers et al., 2010] and Africans [Petrovski et al., 2010; Thye et al., 2010]. Furthermore, the 1000 Genomes Project will provide comprehensive reference panels of common variants, and hence permit accurate imputation, in diverse ethnic groups from African, Asian and American, as well as European-descent populations [The 1000 Genomes Project Consortium, 2010].

With the increasing availability of GWAS data from diverse populations, transethnic meta-analysis may offer an exciting opportunity to increase the power to detect novel loci, through increased sample size, as well as to improve the resolution of fine-mapping of causal variants [Cooper et al., 2008; Zaitlen et al., 2010]. The underlying differences in the structure of LD between ethnic groups can be leveraged to amplify the signal of association at the causal variant. In particular, we would not expect that any set of indistinguishable associated variants will be the same in all populations from different ethnic groups. However, the allele frequency spectrum is also highly variable between diverse populations, with the result that a causal variant may be specific, or more relevant, to one ethnic group. For example, the risk allele for a causal variant for cardiomyopathy in *MYBPC3* has 4% frequency in populations from the Indian subcontinent, but is much rarer or not observed in other ethnic groups [Dhandapany et al., 2009]. Furthermore, causal variants may interact with environmental risk factors that differ in exposure between ethnic groups, generating variability in the marginal allelic effect between populations. It is thus not clear that the findings of GWAS will translate from one ethnic group to another, and hence that we might expect considerable heterogeneity in allelic effects between distantly related populations.

Irrespective of the source of genetic heterogeneity, traditional methodology for the meta-analysis of GWAS, as implemented in the GWAMA software [Magi and Morris, 2010], cannot appropriately take account of the resulting variability in allelic effects between ethnic groups. Fixed-effects meta-analysis assumes the allelic effect to be the same in all populations. Conversely, random effects meta-analysis assumes that each population has a different underlying allelic effect. This is also unsatisfactory since we expect populations from the same ethnic group to be more homogeneous than those that are more distantly related. In order to address this challenge, I have developed novel transethnic meta-analysis methodology that takes account of the expected similarity in allelic effects between the most closely related populations by means of a *Bayesian partition model* [Denison and Holmes, 2001; Knorr-Held and Rasser, 2000]. Briefly, for each variant, allelic effects and the corresponding standard errors are estimated within each population under the assumption of an additive model for the reference allele. Populations are then clustered according to their similarity in terms of relatedness (i.e. shared ancestry) and allelic effects at the variant. Populations within the same cluster are assumed to have the same underlying allelic effect. However, clusters are assumed to have different underlying allelic effects, thus allowing for heterogeneity. The methodology has been implemented in the MANTRA (Meta-ANalysis of Transethnic Association studies) software.

In this article, I apply MANTRA to association studies of type 2 diabetes (T2D) from five diverse ethnic groups [Waters et al., 2010], and highlight the evidence of heterogeneity in allelic effects between populations at the *CDKAL1* locus. I demonstrate, by means of simulation, substantial improvements in the performance of MANTRA, compared to traditional fixed-effects meta-analysis, both in terms of power to detect association, and localization of the causal variant, over a range of models of heterogeneity between ethnic groups. Furthermore, I also demonstrate increased power and mapping resolution for MANTRA over random-effects meta-analysis when the pattern of allelic effects between populations is well captured by the Bayesian partition model. These results highlight the potential of MANTRA to detect and fine-map novel loci for complex traits through application to transethnic GWAS.

### METHODS

- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

Consider the results of a series of *N* transethnic GWAS of a continuous or dichotomous trait, ascertained from populations *P*_{1}, *P*_{2},…, *P*_{N}, at a given variant. We denote by *b*_{i} and *s*_{i} the estimated allelic effect (under an additive model, i.e. log-odds ratio in the context of a dichotomous trait) and corresponding standard error, respectively, of the *i*th study at the variant. In traditional meta-analysis, we typically assume that *b*_{i}∼*N*(β_{i},*s*_{i}), where β_{i} denotes the *i*th population-specific allelic effect.

Under the null model, *M*_{0}, of no association of the variant with the trait in *any* population, β = **0**. In a Bayesian framework, the evidence in favor of the alternative model, *M*_{1}, corresponding to β≠**0**, can be assessed by means of the *Bayes*' *factor* [Kass and Raftery, 1995], given by

In this expression, *f*(**b**,**s**|*M*) denotes the marginal likelihood of the observed allelic effects under model *M*. This marginal likelihood is given by integration over the unknown model parameters, **θ**, which include the population-specific allelic effects, **β**, and additional hyper-parameters relating to their prior distribution, to be defined later. It thus follows that

- (1)

where the likelihood

and

- (2)

#### BAYESIAN PARTITION MODEL

Under a Bayesian partition model [Denison and Holmes, 2001; Knorr-Held and Rasser, 2000], **β** is determined by the assignment of populations to *ethnic clusters*, referred to as a tessellation, and the corresponding cluster allelic effects, **ψ**. The tessellation is defined by specifying *K* cluster centers, , ordered and without replacement from the populations. Remaining populations are then assigned to the “nearest” cluster centre. Here, the distance between the *i*th population, *P*_{i}, and *k*th cluster centre, *C*_{k}, is measured by the *F*-statistic (*F*_{ST}) or some other metric of allele frequency dissimilarity [Weir and Cockerham, 1984; Weir and Hill, 2002; Wright, 1951]. If a population is equidistant from multiple nearest cluster centers, it is assigned to that with minimum *k*. The tessellation is then given by **T**, where *T*_{ik} = 1 if population *P*_{i} is assigned to the cluster with centre *C*_{k}, and 0 otherwise.

For a given tessellation, we can then express the population-specific contribution to the likelihood in Equation (2) as

- (3)

The special case of a single cluster, *K* = 1, corresponds to no heterogeneity between population-specific allelic effects, and thus can be thought of as a Bayesian implementation of fixed-effects meta-analysis. Furthermore, when *K* = *N*, each population is assigned to a different cluster, and thus can be thought of as a Bayesian implementation of random-effects meta-analysis.

#### PRIOR DENSITY FUNCTION

The Bayes' factor, Λ, depends on the prior density function, *f*(**θ**|*M*), of parameters under model *M*. Under the null model, *M*_{0}, the population-specific allelic effects are all zero, and hence any clustering of populations is irrelevant. Hence, *f*(**θ**|*M*_{0}) = 1 if **β** = **0**, and 0 otherwise. Conversely, under *M*_{1}, population-specific allelic effects are determined by the Bayesian partition model. Under this model, the prior density of the number of clusters of populations is given by

In other words, the prior probability of heterogeneity in allelic effects between populations is 0.5. Furthermore, when there is heterogeneity between populations, the number of clusters has a geometric distribution, such that *f*(*K*)/*f*(*K* + 1) = 2. This prior model gives greater probability to a partition with few clusters of populations. This is consistent with a prior belief that allelic effects are most likely to vary between broad ethnic groups, but are less likely to vary between more closely related populations.

Given *K*, each population is equally likely, *a priori*, to be a cluster centre, and the cluster allelic effects have a prior *N*(µ,σ) distribution, independent of **C**, where µ has a prior uniform distribution and σ has a prior exponential distribution with expectation 1. The weak joint prior density *f*(**ψ**,µ,σ) is readily overwhelmed by the data, and has been selected for computationally efficiency. Combining the components of the prior density function, it follows that

#### MCMC ALGORITHM

It is not possible to evaluate the marginal likelihood *f*(**b**,**s**|*M*) directly. However, consider the joint posterior density of under the model *M*, given by

- (4)

This density appears in the integrand of Equation (1) and can be approximated by means of a Metropolis–Hastings MCMC algorithm [Hastings, 1970; Metropolis et al., 1953]. The dimensionality of **θ** depends on the number of clusters of populations and can be addressed by incorporating a birth-death process for *K* by means of a reversible-jump step in the MCMC algorithm [Green, 1995]. In each iteration of the algorithm, candidate parameter values, **θ**^{′}, are proposed by making “small” changes to the current set, as described in Supplementary Methods. The proposed parameter values are then accepted in place of **θ**′ with probability proportional to *f*(**θ**′|**b**,**s**,*M*)/*f*(**θ**|**b**,**s**,*M*); otherwise the current set is retained.

The MCMC algorithm is run for an initial burn-in period to allow convergence from randomly assigned starting values for **θ**. Convergence is assessed using standard diagnostics [Gammerman, 1997]. After convergence, each set of parameter values accepted or retained by the algorithm represents a draw from the posterior distribution *f*(**θ**|**b**,**s**,*M*). To reduce autocorrelation between consecutive draws of **θ**, the sampled set of parameter values is recorded at only every *t*th iteration of the algorithm, for some suitably large *t*.

Over *R* recorded outputs from the MCMC algorithm, with parameter values denoted , the marginal likelihood *f*(**b**,**s**|*M*) is approximated by

the harmonic mean of sampled likelihood values [Newton and Raftery, 1994]. In this expression,

where is given by Equation (3) for parameter values in . An estimate of the Bayes' factor, Λ, can then be obtained from two independent runs of the MCMC algorithm, once each under model *M*_{0} and *M*_{1}.

The interpretation of the Bayes' factor depends on our prior beliefs about SNP association with the trait under investigation. On the basis of one million independent loci across the genome, plausible prior odds might be of the order of 10^{4}−10^{6} against association [The Wellcome Trust Case Control Consortium, 2007]. Consequently, a Bayes' factor of the same order of magnitude would be necessary to provide convincing evidence of association [Stephens and Balding, 2009]. Alternatively, we could approximate the Bayesian false-discovery probability [Wakefield, 2007], and could vary the prior probability of association of each SNP according to annotation and/or minor allele frequency [Wang et al., 2005].

Output from the MCMC algorithm can be used directly to approximate the posterior distribution of the allelic effect, β_{i}, in the *i*th population. Over *R* outputs, the posterior mean of this distribution is given by

Output from the algorithm can also be used to approximate the posterior probability of heterogeneity in allelic effects between populations under the alternative model of SNP association with the trait, given by the proportion of MCMC outputs for which *K* is greater than one. The prior model, *f*(*K*), assumes allelic effects to be equally likely to be homogeneous or heterogeneous across populations, so that *f*(*K* = 1) = *f*(*K*>1) = 0.5. Thus, a posterior probability of heterogeneity of greater than 0.95 would provide strong evidence of a deviation from homogeneity in allelic effects across populations. In this case, the posterior probability of heterogeneity in allelic effects between any given pair of populations can be approximated by the proportion of MCMC outputs for which they are assigned to different clusters of the Bayesian partition model. These probabilities can be used to construct a dendogram to represent the similarity between populations in terms of relatedness and allelic effects by application of average-linkage hierarchical clustering techniques [Hartigan, 1975].

#### SOFTWARE AVAILABILITY

The MANTRA software has been developed to implement two independent runs of the MCMC algorithm, once each under *M*_{0} and *M*_{1}. For each variant, and each population, MANTRA requires the following information: (i) the effect allele; (ii) the estimated effect allele frequency; (iii) the estimated allelic effect (log-odds ratio in the context of a dichotomous phenotype) and the corresponding standard error. For each variant, the software will estimate the Bayes' factor, Λ, in favor of association and summarize the output of the MCMC algorithm. MANTRA is available, as a suite of executables, on request from the author.

The run-time of the algorithm, per SNP, depends crucially on the number of studies, but is feasible on the scale of the whole genome through efficient parallel processing. For example, application of the MANTRA software to the meta-analysis of 28 transethnic GWAS, imputed up to 2.5 million SNPs from the International HapMap Project [The International HapMap Consortium, 2007], took less than 1 week with a cluster of 32 dedicated processors.

### DISCUSSION

- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

Meta-analysis of GWAS of primarily European-descent populations has been an extremely efficient approach to identifying novel loci contributing effects to complex traits by increasing sample size without de novo genotyping. The underlying assumption of traditional fixed-effects meta-analysis is that the allelic effect of a given variant is homogeneous across studies. For GWAS ascertained from the same or closely related populations, such an assumption is reasonable. The recent shared ancestry of these populations increases the likelihood that they will have the same underlying common causal variants, similar allele frequency spectra and local LD profiles. Exposure to potential nongenetic risk factors, such as diet, smoking, and pollution, which may interact with genotypes at causal variants, is also likely to be similar in European populations, further reducing the prospect of heterogeneity in allelic effects between them.

With the increasing availability of GWAS from more diverse populations, transethnic meta-analysis might be expected to further increase power to detect additional complex trait loci with ever more modest effects. However, with more diverse populations, less recent shared ancestry introduces greater opportunity for genetic heterogeneity, both in terms of the underlying causal variants and their allelic effect on the trait. Standard statistical methodology exists for assessing the evidence of heterogeneity in fixed-effects meta-analysis, such as *I*^{2} and Cochran's *Q*-Statistic [Higgins and Thompson, 2002; Huedo-Medina et al., 2006; Ioannidis et al., 2007], and can thus be used to highlight populations with outlying allelic effects. In the presence of such allelic heterogeneity, these outlying populations could be removed, although potentially resulting in a reduction in power. On the other hand, random-effects meta-analysis, which assumes that each population has a different underlying allelic effect, can be used to overcome the problem of heterogeneity. However, this is also unsatisfactory since we expect populations from the same ethnic group to be more homogeneous than those that are more distantly related. A plausible alternative approach to transethnic meta-analysis would be to make use of a hierarchical model in which the allelic effect estimates for each population are considered as a function of indicator variables that represent ethnic group. This approach has the advantage over random-effects meta-analysis of allowing for similarity in allelic effects across populations from the same ethnic group. However, the assignment of populations to ethnic groups is prespecified by this prior classification, and cannot borrow from the observed allelic effect estimates to inform clustering.

In this article, I have addressed the challenges of allelic effect heterogeneity posed by transethnic meta-analysis of GWAS by considering the relatedness between the populations from which they have been ascertained. The Bayesian partition model provides a natural framework to take advantage of the expectation that more closely related populations are more likely to have similar allelic effects than those from diverse ethnic groups. The key advantage of this approach over a purely random effects analysis is that we can model the allelic heterogeneity between ethnic groups. Specifically, populations are clustered according to their “prior” similarity in terms of relatedness, typically using genomewide data to approximate their shared ancestry, and their semblance in terms of allelic effects at a specific variant under investigation. Populations within the same cluster are assumed to have the same underlying allelic effects at this variant. However, different clusters need not have the same underlying allelic effect. MANTRA can thus be thought of as a *hybrid* meta-analysis, incorporating both fixed (i.e. *within* cluster) and random (i.e. *between* clusters) effects.

The application of MANTRA to transethnic association studies of T2D at 19 variants in established susceptibility loci highlighted little evidence of heterogeneity in allelic effects between five diverse populations. However, there was overwhelming evidence of heterogeneity at rs7754840 in the *CDKAL1* locus. Allelic effects on T2D were in the same direction in all populations, but were considerably stronger in the closely related Japanese Americans and Native Hawaiians than in European Americans, Latinos, or African Americans. Such heterogeneity could arise as a result of multiple causal variants in *CDKAL1*, one of which is specific to the Japanese American and Native Hawaiian populations. However, this pattern of allelic effects could also arise with a single causal variant as a result of differences in the local LD structure between populations. In particular, rs7754840 may better capture the causal variant in the Japanese American and Native Hawaiian populations, which is not implausible given their recent shared ancestry. Interestingly, the lack of heterogeneity in allelic effects at the majority of established T2D loci suggests that the underlying causal variants are the same across ethnic groups, and hence pre-date any “out of Africa” population migration, which cannot be well modeled by “synthetic association” of multiple rare alleles [Dickson et al., 2010].

The results of the simulation study highlight that the hybrid meta-analysis implemented in MANTRA outperforms fixed-effects, both in terms of power to detect association, and localization of causal variants, over a range of models of heterogeneity in allelic effects between diverse populations. The greatest gains in power are achieved under a model of heterogeneity in which the causal variant has opposing effects in different populations, although it is not clear how realistic this scenario is likely to be. Under a model of homogeneous allelic effects across ethnic groups, there is no discernible loss in power or fine-mapping accuracy for the hybrid MANTRA analysis over fixed-effects meta-analysis. Furthermore, there are noticeable improvements in the localization of causal variants with MANTRA when applied to meta-analysis of transethnic, rather than intraethnic GWAS, even under a model of homogeneous allelic effects across populations. These improvements in the resolution of fine-mapping reflect transethnic differences in local LD patterns which cannot be leveraged from GWAS ascertained from the same population. The results of the simulation study also highlight advantages of the hybrid MANTRA analysis over random-effects meta-analysis, both in terms of power and localization of causal variants, when heterogeneity in allelic effects is well represented by the prior Bayesian partition model. Output from the MANTRA MCMC algorithm can also be used to represent the pattern of heterogeneity in allelic effects between populations, which cannot be achieved with random-effects meta-analysis.

The use of diverse populations from multiple ethnic groups will play an essential role in future GWAS. European-descent populations contain only a subset of human genetic variation, and thus cannot be used to identify causal variants across ethnic groups. This is particularly relevant for lower frequency causal variants, which are more likely to be population specific, but which have been hypothesized to contribute substantially to the missing heritability of complex traits [Frazer et al., 2009]. The reduced bias of GWAS genotyping products toward European genetic variation, and the increasing availability of large-scale resequencing reference panels from a wide range of ethnic groups, greatly improves the prospects of imputation across diverse populations. Efficient and powerful statistical methodology for the analysis of transethnic GWAS, such as the MANTRA software developed here, thus shows great promise for future improvements in our understanding of the genetic architecture of complex human traits.