Aruna T. Bansal, Acclarogen Ltd, St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK. Tel: (+44) 1223 421 662; Fax: (+44) 1223 420 844; E-mail: email@example.com
Rheumatoid arthritis (RA) is strongly associated with the human leukocyte antigen (HLA) genomic region, most notably with a group of HLA-DRB1 alleles termed the shared epitope (SE). There is also substantial evidence of other risk loci in the HLA region, but refinement has been hampered by extensive linkage disequilibrium (LD). Using genotype imputation, we analysed 6575 RA cases and controls with genotypes at 6180 HLA SNPs; about half the subjects had four-digit DRB1 genotypes. Single-SNP tests revealed hundreds of strong associations across the HLA region, even after adjusting for DRB1. We implemented penalised logistic regression in a multi-SNP association analysis using the double-exponential (DE) penalty term on the regression coefficients and the normal-exponential-gamma (NEG). The penalised approaches identified sparse sets of SNPs that could collectively explain most of the association with RA over the whole HLA region. The HLA-DPB1 SNP rs3117225, was consistently identified in our analyses and was confirmed by results from the North American Rheumatoid Arthritis Consortium study (NARAC). We conclude that SNP selection using penalised regression shows a substantial benefit over single-SNP analyses in identifying risk loci in regions of high LD, and the flexibility of the NEG conveys additional advantages.
Rheumatoid arthritis (RA) is a complex autoimmune disorder that affects approximately 1% of the population worldwide (Silman & Pearson, 2002). The cause of RA has not been established, but both environmental and genetic factors appear to play important roles (Firestein, 2003; Oliver & Silman, 2006). The HLA-DRB1 gene within the class II Human Leucocyte Antigen (HLA) region of the genome is strongly associated with RA susceptibility. A group of HLA-DRB1 alleles known collectively as the shared epitope (SE), encode a similar amino acid sequence located in the peptide-binding groove of the protein (Gregersen et al., 1987). In our previous investigation of the Genetics of Rheumatoid Arthritis (GoRA) study (Vignal et al., 2009) we found that the addition of HLA SNPs in a backwards stepwise regression led to a significantly better model than that based only on the number of SE+ alleles. However, there has also been considerable debate about whether a binary classification SE+/SE- is sufficient to describe the RA risk for all DRB1 alleles.
There are many strong HLA associations outside DRB1 (Vignal et al., 2009; Jawaheer et al., 2002; Kilding et al., 2004; Newton et al., 2003; Ota et al., 2001; Singal et al., 1999; Zanelli et al., 2001), but extensive linkage disequilibrium (LD) in the HLA region has retarded progress in fine-mapping the causal variants underlying these signals. Multi-SNP analyses can help to disentangle the effects of LD and identify a set of distinct contributors to disease risk. Forward and backward stepwise regressions are traditional model selection methods but are unsatisfactory in the presence of many correlated predictors, due to computational demands and instability of the selected model. Penalised regression provides a better approach to overcoming the problems of over-fitting and of multiple correlated predictors (Ayers & Cordell, 2010). The penalty term imposed on the likelihood can have the effect of shrinking the maximum likelihood estimates towards zero, reducing over-fitting. Moreover, with an appropriate penalty term penalised regression can reduce the problem of correlated predictors by in effect selecting one or a few SNPs that best represent an association signal shared across a set of SNPs in high LD.
If the penalty term can be interpreted as a probability density, penalised regression is equivalent to maximum a posteriori (MAP) estimation of the regression coefficients using this prior density. Although it is then convenient to use the Bayesian language of prior and posterior, for computational reasons we do not derive the full Bayesian posterior distribution for the regression coefficients, but only the MAP estimates.
A commonly-used prior for penalised regression is the double exponential (DE). In linear regression, use of the DE prior corresponds to the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm (Tibshirani, 1996). For the logistic regression equivalent, we used the Bayesian Binary Regression (BBR) software (Genkin et al., 2007). If the DE scale parameter is large, corresponding to a prior belief that the most likely effect sizes are close to zero, the resulting MAP estimates are often exactly zero thus leading to sparse models. We also considered a generalisation of the DE prior called the Normal-Exponential-Gamma (NEG), implemented in the HyperLASSO (HLASSO) algorithm (Hoggart et al., 2008). The NEG has both shape and scale parameters, and smaller shape values correspond to increasing steepness of the density curve near zero but flatter tails, which is expected to lead to sparse sets of nearly-uncorrelated SNPs.
Genotype imputation using MACH (Li et al., 2006; Li et al., 2009) and the 60 CEU HapMap parents was employed to combine the GoRA study with the Wellcome Trust Case Control Consortium (WTCCC) controls and RA cases (The Wellcome Trust Case Control Consortium, 2007) (Fig. 1A). After quality control (QC), the combined dataset included 2679 RA cases and 3896 controls with genotypes at 6180 SNPs spanning the HLA region. Only 5% of SNPs were genotyped in both WTCCC and GoRA, while 56% of SNPs were not genotyped in either study but entirely imputed from the HapMap (Fig. 1B). The GoRA and HapMap subjects, and about half of the WTCCC controls, had available four-digit HLA-DRB1 alleles; SE and also DRB1*0101, one of the SE alleles, were imputed in the remaining individuals.
We employed penalised regression on the combined dataset to identify a sparse set of uncorrelated SNPs that explains most of the non-DRB1 association with RA. Bootstrapping was implemented to investigate the robustness of the models derived, and the North American Rheumatoid Arthritis Consortium (NARAC) (Plenge et al., 2007) study was used to validate the findings.
The GoRA and WTCCC Studies
GoRA study subjects have previously been described (Vignal et al., 2009). To summarize, DNA samples were available for 1832 subjects, of whom 29 were excluded due to average genotype call-rate < 70%. Cases (n = 845) and controls (n = 958) were all Caucasian from Sheffield, UK. Research ethics committee approval was obtained for the study (SSREC protocol number 02/186), and all participants gave informed consent. After QC, the WTCCC study included 1834 RA cases and 2938 controls (Li et al., 2006), all Caucasians and recruited throughout the UK. Controls were recruited from the 1958 British Birth Cohort (BC58, n = 1480) and the UK Blood Services (n = 1458). The cases for both studies fulfilled the 1987 American College of Rheumatology criteria for RA. The WTCCC “population controls” are unphenotyped, but since RA is rare they are expected to differ little from controls confirmed to be free of RA (McCarthy et al., 2008).
The case-control ratio differs across the studies. Further, cases and controls were matched on sex in GoRA, but in the WTCCC 75% of cases are female compared with 50% of controls. We thus included in all regression models two binary variables indicating (1) whether the subject was male and from the WTCCC study (maleWT) and (2) whether the subject was female and from the WTCCC (femWT).
SNP and HLA Genotyping
Genotype data included SNPs from the GoRA, WTCCC and Phase II HapMap samples located within the extended HLA region from position 28808331 to 33458152 (NCBI35) (Horton et al., 2004) and that passed quality control in the different studies. The GoRA samples were genotyped at 2302 SNPs from the Illumina MHC panel chosen for uniform coverage except for additional SNP density near or within exons. The WTCCC samples were genotyped at 1193 HLA SNPs using Affymetrix SNP Array 5.0. The HapMap samples were genotyped at 9738 HLA SNPs. High resolution (4-digit) HLA-DRB1 typing was also available for the GoRA and HapMap samples and for the BC58 WTCCC controls (Power & Elliott, 2006). Due to the complexity of typing HLA loci, some subjects had only one DRB1 allele reported and were re-coded as missing. If only 2-digit resolution was available, we used the “allele code tool” of the National Marrow Donor Program (http://bioinformatics.nmdp.org/HLA/hla_res_idx.html) to see if all 4-digit possibilities were of the same SE class; if so the allele was coded accordingly and otherwise as missing. Overall, 842 GoRA cases, 957 GoRA controls, 1327 BC58 WTCCC controls and 48 CEU HapMap subjects were assigned DRB1 genotypes.
The following HLA-DRB1 alleles were classified as shared epitope positive (SE+) according to the presence of the amino acid sequences QKRAA, QRRAA or RRRAA at positions 70–74 of the third hyper-variable region of the DRB molecule (Gregersen et al., 1987): *0101, *0102, *0104, *0401, *0404, *0405, *0408, *0409, *0426, *0430, *0433, *0438, *0440, *1001, *1113 and *1421; all other alleles were SE-. The nine DRB1 alleles having allele fraction > 2% in the GoRA sample, used as the starting state in the backwards regression to identify the best model for DRB1 effect on RA, were *0101, *0301, *0401, *0404, *0701, *1101, *1301, *1302 and *1501.
MACH v 1.0.16 (Li et al., 2006; Li et al., 2009) was used for all genotype imputation including SE and *0101. Genotypes in 9882 SNP markers from the GoRA, WTCCC and HapMap samples were combined ensuring consistent strand orientation. Haplotypes, recombination and mutation rates were estimated by MACH using 100 Markov chain iterations following a burn-in of 25 iterations. The estimated r2 measure provided by MACH was used to assess the quality of the imputed genotypes. This compares the variance of the estimated genotypes with what would be expected for error-free genotypes given the allele proportions. A threshold of r2 > 0.3 was recommended by Li et al. (2006), but to ensure high data quality we imposed the stricter requirement of r2 > 0.6, leading to the rejection of 1001 SNPs. SE calls were masked in the 50% of WTCCC controls for which they were actually available. Mean r2 over all individuals was the same (0.89) as when no masking was employed. Some differences in allele proportions between GoRA and WTCCC controls is legitimate, but because both studies sampled UK Caucasians any such differences are expected to be small. Examining the 289 markers genotyped in both studies revealed only three with allele proportions in controls significantly different at P < 10−4 (two-sided t-test). We removed 2703 imputed SNPs significant at this level. These stringent QC measures predominantly eliminated SNPs that were entirely imputed from the HapMap samples (Fig. 1). As a measure of high quality of the SNPs retained for analysis, we repeated the imputation with 5% masking of recorded genotypes from GoRA and WTCCC studies, and compared the observed genotype with the most likely imputed genotype. Over 95% of SNPs genotyped in either study had < 1% errors in imputing the masked genotypes, while over 99% of SNPs had error rates < 3%. Note that the upper tail of the SNP error rate distribution is inflated by sampling noise from the low numbers of masked genotypes, plus the masking reduces accuracy of imputation, and so true error rates are likely to be even better than these figures suggest.
The expected minor allele count used in the regression analyses was obtained for each individual at each SNP as 2Paa+PAa, where Pi is the probability of genotype i, a is the rare allele and A is the common allele.
The NARAC Study
The North American Rheumatoid Arthritis Consortium (NARAC) study (Plenge et al., 2007) included 908 cases and 1260 controls all of self-reported white ancestry. The cases carried antibodies directed against cyclic citrullinated peptides, i.e. anti-CCP+ patients and the cases fulfilled the 1987 American College of Rheumatology criteria for RA. Genotypes were available for 1,625 SNPs spanning the HLA region as well as 4-digit HLA-DRB1 typing. The SE+ and *0101 indicator variables were imputed for individuals without a full four-digit DRB1 genotype. The HLASSO-selected SNPs from the combined dataset were also imputed, i.e. rs3117225, rs3132472 and rs3132671, but no other SNPs were imputed.
Population substructure was assessed using genomic control (Devlin & Roeder, 1999). We had available 589 non-HLA genome-wide markers in low LD that were genotyped in both GoRA and WTCCC and had less than 5% missing data. The markers were from a prior pharmacogenetic candidate gene study and were not selected to be ancestry-informative but only to be in approximate linkage equilibrium. 6414 subjects (nGoRA= 1643, nWTCCC= 4771) were used to estimate an inflation factor λ of 1.09 in the combined dataset. This indicates modest evidence of population structure, but because a study indicator was included in all our regression models only structure within either the GoRA or WTCCC studies could contribute to a spurious association signal. The within-study λ values of 1.02 and 1.03 indicate little cause for concern. Therefore no covariates were introduced into the regression analyses of the combined dataset to account for population structure.
Penalised Logistic Regression
In penalised regression, rather than maximising the log-likelihood to estimate the vector of regression coefficients , we seek, where is a penalty term that is often interpretable as the log of a prior density, in which case is the maximum a posteriori (MAP) estimate. The BBR algorithm allows either a Gaussian or a double exponential (DE) prior density; in either case independence is assumed for each predictor. The Gaussian prior density corresponds to ridge regression, but is unsuitable here because is then typically non-zero in every component. We adopt the DE prior with scale parameter sufficiently large that most components of are zero, reflecting no association with phenotype. We chose the smallest value of consistent with no more than one false positive among the 6180 SNPs, on average over the 100 null simulations. The simulations used the combined dataset with case-control labels randomly permuted within SE genotype classes (0, 1 or 2 SE+ alleles), in order to preserve the SE effect but eliminate any other association with case-control status.
The normal-exponential-gamma (NEG) prior distribution has been developed to improve on the DE to better deal with multi-collinearity among predictors (Hoggart et al., 2008). As the NEG shape parameter increases the NEG converges to the DE, but as decreases, the NEG density becomes more sharply peaked at the origin and has flatter tails. Hoggart et al. (2008) sought tails as flat as possible and chose γ= 0.05 as the smallest value of γ that avoided computational difficulties. However, Table S1 shows that this implies an unrealistically high probability assigned to large effect sizes. It is unclear to what extent this lack of realism is problematic since we do not conduct a full Bayesian analysis. Stephens & Balding (2009) considered γ values between 1 and 2 and showed that these could give more realistic effect size distributions. We decided to adopt γ= 1 for all our analyses. Scale parameters η= 0.009 and 0.026 were selected for the combined GoRA/WTCCC and NARAC studies, respectively; these give on average one false positive result per analysis.
A bootstrap approach was used to investigate the repeatability of the SNP selection by BBR and HLASSO logistic regression. SNPs with r2≥ 0.90 were grouped together using the pair-wise tagging algorithm in Haploview (Barrett et al., 2005) and these groups were treated as a single SNP for frequency counts and ranks. Among a thousand bootstrap models, BBR selected 268 distinct SNP (groups) at least once, and the HLASSO identified 391.
Association of RA with HLA-DRB1 Alleles and Covariates
We employed backwards regression in the GoRA study, starting from a model that included the numbers of SE+ alleles (0, 1 or 2) plus the numbers of each of the common DRB1 alleles. Each variable was analysed as an additive effect on the logistic scale. Only DRB1*0101 and SE+ remained at the termination of the analysis. Note that since DRB1*0101 is an SE+ allele, the two variables are necessarily correlated. This model was significantly better than the number of SE+ allele copies alone (P= 4.0×10−4). The RA risk conveyed by *0101 appears to be lower than for other SE+ alleles, and because *0101 is a common allele this difference is highly significant. Modelled individually, the numbers of SE+ and *0101 allele copies had estimated odds ratios (OR) for RA of 3.5 and 1.5, with 95% confidence intervals (CI) [3.0, 4.1] and [1.2, 1.8], respectively. These effect sizes were replicated in the WTCCC study using part-imputed SE and *0101 variables (Table 1). Modelled jointly, the numbers of SE+ and *0101 alleles had estimated OR for RA of 4.0 and 0.6 with 95% CIs [3.4, 4.8] and [0.5, 0.8]. Note that the imputation ignored case-control labels and so could not have contributed to a spurious similarity of effect size estimates across studies. We cannot say whether the reduced risk for the *0101 allele is due to an intrinsic difference between it and other SE+ alleles, or to LD between *0101 and one or more effect-modifying SNPs.
Table 1. Odds ratios and association test results for the number of SE + and *0101 allele copies, modelled as individual predictors, additive on the logistic scale.
Number of SE + alleles1
Number of *0101 alleles
1For comparison with other analyses, we adhere to the convention that the *0101 allele is included in the SE + group of alleles in all analyses.
3.5 (3.0; 4.1)
1.5 (1.2; 1.8)
3.6 (3.2; 4.0)
1.6 (1.4; 1.9)
Single-SNP Association Tests
Fig 2 shows results from additive-model single-SNP logistic regression analyses of the combined GoRA and WTCCC studies, including as covariates the numbers of SE+ and *0101 allele copies and two sex/study variables (see Methods). Compared with adjustment on SE+ only, including the *0101 adjustment reduced the number of SNPs significant at P < 8.1×10−6 (family-wise type 1 error rate 0.05) from 606 to 467 (Table 2). However, this still leaves an unmanageably large set of strongly-associated SNPs, from across the HLA region (Fig. S1), which are likely to correspond to a much smaller set of underlying causal variants.
Table 2. Number of SNPs significant at 10−3, 1.6×10−4 and 8.1×10−6 in adjusted logistic regression analyses.
l Expected under null
Adjustment (all models adjusted for sex/study)
SE & *0101
SE, *0101 & 3 HLASSO SNPs
Penalised Logistic Regression Using BBR and HLASSO
For the DE scale parameter λ, we selected a value of 182, chosen to correspond to an average under the null of one false positive among the 6180 SNPs, after adjusting for the SE effect (see Methods). No penalty was applied to the covariates (SE+, *0101, sex/study), corresponding to a uniform prior. All 6180 SNPs were analysed simultaneously, as additive effects (on the logistic scale), using the BBR algorithm, which identified five SNPs (Table 3). Rs3132472 in MICA had the smallest single-SNP P-value, whereas the ranks of the remaining four BBR-selected SNPs among the single-SNP p-values were: 127 for rs3094139, 132 for rs3132671, 139 for rs3117225 and 140 for rs927753. None of the five BBR-selected SNPs was correlated with the numbers of SE+ or *0101 allele copies (Fig. 3), but two pairs of SNPs are in high LD (r2 > 0.90); these are in the vicinity of HLA-DPB1 and TRIM26.
Table 3. SNPs selected by BBR and HLASSO.
Selected by: B = BBR H = HLASSO
Genotyped in: H = HapMap G = GoRA W = WTCCC
1Single-SNP association P after adjustment for SE +, *0101, sex and study.
B + H
B + H
H + G
B + H
H + G + W
The shape and scale NEG parameters for the HLASSO analysis were chosen to be γ= 1 and η= 0.009 (see Methods). As for BBR, these choices imply less than one false positive per model (given one additive predictor per SNP). One hundred iterations of the HLASSO algorithm were performed to seek multiple posterior modes, of which we retained the largest (Hoggart et al., 2008). Three SNPs were included in the resulting model (Table 3), all of which were also selected by BBR. These markers were not in LD with each other or with SE+ or *0101 (Fig. 3). Rs3117225 and rs3132472 were still associated with RA at p < 10−10 in a multi-SNP logistic regression analysis that included the three HLASSO SNPs, the numbers of SE+ and *0101 allele copies and the two sex/study covariates (Table 4).
Table 4. Multi-SNP P-value of the SNPs selected by HLASSO.
OR (95% confidence interval)
1Multi-SNP association P-value from a model including rs3117225, rs3132472, rs3132671, SE +, *0101, sex and study as covariates. maleWT is one if the subject is male and from WTCCC and zero otherwise. femWT is one if the subject is female and from WTCCC and zero otherwise.
0.7 (0.6, 0.8)
1.5 (1.3, 1.7)
1.2 (1.1, 1.3)
0.4 (0.3, 0.4)
1.1 (1.0, 1.3)
4.1 (3.7, 4.5)
0.7 (0.6, 0.8)
Fig 2 also shows the locations within the HLA region of the SNPs selected by the BBR and HLASSO logistic regression analyses. We repeated the single-SNP association tests, but now including the three HLASSO SNPs as covariates in the logistic regression, in addition to the two DRB1 and two sex/study covariates (Table 4). No SNPs were significant at 1.6 × 10−4 (= 1/6180) compared with 689 without the HLASSO SNPs (Table 2). Twelve SNPs remained significant at 10−3, only a modest excess over the six expected under the null and greatly less than the 935 SNPs significant at 10−3 without conditioning on the HLASSO SNPs. Thus, the HLASSO has identified three SNPs that together explain almost all of the hundreds of strong non-DRB1 associations with RA at SNPs across the HLA region.
Robustness of the SNP-selection
Bootstrap simulations confirmed that the five BBR-selected SNPs remained the most often selected under perturbations of the dataset (Table 5). Reproducibility was good for the two DPB1 SNPs (80% of bootstrap replicates), almost 60% for the MICA SNP, but less than 40% for the two TRIM26 SNPs. Reproducibility was slightly worse for the HLASSO bootstrap. These results indicate that, while our models succeed in accounting for much of the non-DRB1 association signal across the HLA, there are many similar models that are almost as good.
Table 5. Bootstrap results for SNPs selected by BBR and HLASSO.
1Frequency and rank among 1,000 bootstrap models. Groups of SNPs of which all pairs have (r2≥ 0.9) were counted as a single SNP; this affects the pairs rs9277553/rs3117225 and rs3132671/rs3094139.
The NARAC Study
The NARAC study cannot be exactly considered as a replication study since the RA patients all tested positive for anti-cyclic citrullinated peptide antibody (anti-CCP+). The latter is detected by means of a blood test and may be used in the diagnosis of RA. The GoRA study included mostly anti-CCP+ cases (85%), but no information about the anti-CCP status of the WTCCC RA cases was provided. The results from this analysis have hence to be interpreted with care. We note that 79% of NARAC cases were female compared to 75% of WTCCC cases and 72% of GoRA cases. Eleven SNPs were retained in the HLASSO model in addition to the SE and *0101 (Table S2). Among these associations, the fully-imputed DPB1 SNP rs3117225 was retained, providing further support for the role for DPB1 in RA susceptibility. The other two HLASSO SNPs from our combined analysis were not selected in the HLASSO analysis of the NARAC study, but were significant in single-SNP analyses (P= 4.2×10−6 and 4.1×10−5, respectively).
We have confirmed the well-known association of the SE with RA susceptibility in both the GoRA and WTCCC case-control studies, but found that the modelling of RA risk at HLA-DRB1 alleles was significantly improved by including the number of *0101 allele copies in addition to the number of SE+ alleles. Using these two covariates in addition to two sex/study indicators, we employed a multi-SNP penalised regression analysis to investigate non-DRB1 HLA risk loci for RA in a combined GoRA/WTCCC dataset obtained using imputation. The BBR algorithm, employing a double exponential prior penalty, selected five SNPs including two highly-correlated pairs. The HLASSO algorithm uses an NEG prior/penalty with sharper peak at the origin and flatter tails; it gave the same answer as BBR except that one of each correlated pair was dropped, leaving rs3117225 near HLA-DPB1, rs3132472 near MICA and rs3132671 in TRIM26 as the SNPs in the fitted model. Conditioning on these three SNPs in a multi-SNP regression model showed that they explained almost all the non-DRB1 association with RA across the HLA region, which for example generated 467 SNPs significant at 8.1×10−6 (family-wise type I error 0.05) without any adjustment for non-DRB1 SNPs (Table 2). Thus, HLASSO logistic regression has succeeded in identifying a sparse model of nearly-uncorrelated SNPs that can explain the complex pattern of association with RA across the HLA region.
MICA has previously been reported as associated with RA (P < 10−4) in the GoRA study (Vignal et al., 2009), and also as having a DRB1-independent association with RA in a Spanish family-based study (Martinez et al., 2001). No genetic association has previously been reported between TRIM26 and RA, but there is biological evidence for a role of genes in the TRIM family in autoimmune disease; they have protein-degrading functions that are involved in normal cell growth and immune functions (Horton et al., 2004; Meyer et al., 2003).
In contrast to TRIM26 and MICA, the association of RA with rs3117225 in HLA-DPB1 was replicated in the NARAC study, which unlike GoRA and WTCCC included only anti-CCP+ cases. This is the most important subtype of RA typically comprising about two thirds of RA patients (de Vries et al., 2005). This difference in phenotype definition and the smaller size of the NARAC study may have both contributed to a failure to replicate the MICA and TRIM26 associations in the HLASSO analyses, even though both were significant in single-SNP analyses of the NARAC study. Population structure may also have had a confounding role (Taylor & Criswell, 2009) since the NARAC study included European-Americans who have more diverse ancestries than UK Caucasians. HLA-DPB1 has recently been associated with RA (Ding et al., 2009; Lee et al., 2008). Ding et al. (2009) analysed samples from the Swedish Epidemiological Investigation of RA (EIRA) study and identified rs3117213, also in HLA-DPB1, as a DRB1-independent risk locus associated with anti-CCP+ RA in a single-SNP association analysis adjusted for HLA-DRB1. They also reported replication of this association in the NARAC study. Rs3117213 was not genotyped in GoRA, but analysis of the HapMap CEU samples revealed that it is in high LD with the DPB1 SNP that we report here, rs3117225 (r2= 0.96).
A previous analysis of the GoRA study (Vignal et al., 2009) identified SNPs rs4678 and rs2442728 as associated with RA. These SNPs were significant in the single-SNP association tests in the combined dataset with P= 2.8×10−8 for rs2442728 and P= 5.3×10−10 for rs4678, but the two penalised regression algorithms did not select them. Similarly, none of the reported WTCCC associations with RA (The Wellcome Trust Case Control Consortium, 2007) were replicated in our combined analysis, which is likely to be due to the WTCCC analysis not adjusting for the strong effect on RA of HLA-DRB1. Note that almost all of the hundreds of HLA SNPs showing strong association with RA are likely to correspond to true associations in the sense that the signal reflects LD with one or more causal variants. Our goal is to identify a small set of strong candidates either for being directly causal or the best tag of an ungenotyped causal variant.
Imputation seems not to have been previously used to combine studies for HLA-based association analyses. Imputation in the HLA region, particularly if the objective is to impute classical HLA alleles, can be challenging due to the fact that the region is under high selection (Leslie et al., 2008; Nothnagel et al., 2009). Imputation of SNP genotypes as performed here is a more achievable goal, and the presence of high LD in the HLA region makes this a realistic possibility for studies conducted in similar populations. In our combined dataset all subjects were of European ancestry, recruited in the UK using the same case inclusion criteria. In order to ensure high-quality imputed data, approximately 3500 SNPs were discarded at imputation QC, leaving 6180 SNPs available for analysis. This minimises the risk of a spurious association being identified at a poorly-imputed SNP, but discarding SNPs may also reduce our precision in tagging causal variants. However the 6180 HLA SNPs retained for analysis provide the potential for high–precision localisation of variants across almost the entire HLA, with a mean density of 1.25 SNPs per Kb, while 95% of inter-SNP intervals are <3 Kb and 99% are <7 Kb. All our association analyses assumed an additive model of association at each SNP, and in this case uncertainty in imputation can be taken into account by using expected allele count as the SNP genotype regression variable.
Previous applications of penalised regression approaches to fine mapping within a region have focussed on analysing genotyped rather than imputed SNPs. Given that penalised methods are especially designed to deal with strong collinearity between predictors, we do not anticipate any particular problem with applying them to combinations of genotyped and imputed SNPs. Although the predictors corresponding to imputed SNPs are by definition functions of the predictors corresponding to genotyped SNPs, they will not necessarily correspond to simple linear functions of the observed SNP genotypes and, in any case, including them as individual predictors allows the model itself to pick out those effects that best capture the association pattern in the region.
We conclude that the HLASSO logistic regression approach, which generalises the BBR algorithm, shows substantial promise for disentangling complex patterns of association in regions of high LD. Hoggart et al. (2008) found in simulations that the number of SNPs included in the model selected by HLASSO accurately reflected the number of distinct associations, and hence we suggest that three may be a good estimate of the number of detectable HLA associations with RA other than DRB1. There are >1010 three-SNP subsets of 6180 SNPs, and the bootstrap results illustrate that we are far from certain that our HLASSO-selected three-SNP model is the best possible in terms of tagging the true causal variants, but the role of DPB1 has been confirmed since we developed our analysis and future efforts will elucidate whether the other SNPs also closely tag true causal variants.
Special thanks are due to the participants in our study. The authors would also like to thank Mike Binks (GlaxoSmithKline UK), Marion Dickson (GlaxoSmithKline UK), Doug Montgomery (GlaxoSmithKline UK) and Gerry Wilson (University of Sheffield) for their support, as well as Clive Hoggart (Imperial College London) for help with implementation of the HyperLASSO.