## Introduction

Rheumatoid arthritis (RA) is a complex autoimmune disorder that affects approximately 1% of the population worldwide (Silman & Pearson, 2002). The cause of RA has not been established, but both environmental and genetic factors appear to play important roles (Firestein, 2003; Oliver & Silman, 2006). The *HLA-DRB1* gene within the class II Human Leucocyte Antigen (HLA) region of the genome is strongly associated with RA susceptibility. A group of *HLA-DRB1* alleles known collectively as the shared epitope (SE), encode a similar amino acid sequence located in the peptide-binding groove of the protein (Gregersen et al., 1987). In our previous investigation of the Genetics of Rheumatoid Arthritis (GoRA) study (Vignal et al., 2009) we found that the addition of HLA SNPs in a backwards stepwise regression led to a significantly better model than that based only on the number of SE+ alleles. However, there has also been considerable debate about whether a binary classification SE+/SE- is sufficient to describe the RA risk for all *DRB1* alleles.

There are many strong HLA associations outside *DRB1* (Vignal et al., 2009; Jawaheer et al., 2002; Kilding et al., 2004; Newton et al., 2003; Ota et al., 2001; Singal et al., 1999; Zanelli et al., 2001), but extensive linkage disequilibrium (LD) in the HLA region has retarded progress in fine-mapping the causal variants underlying these signals. Multi-SNP analyses can help to disentangle the effects of LD and identify a set of distinct contributors to disease risk. Forward and backward stepwise regressions are traditional model selection methods but are unsatisfactory in the presence of many correlated predictors, due to computational demands and instability of the selected model. Penalised regression provides a better approach to overcoming the problems of over-fitting and of multiple correlated predictors (Ayers & Cordell, 2010). The penalty term imposed on the likelihood can have the effect of shrinking the maximum likelihood estimates towards zero, reducing over-fitting. Moreover, with an appropriate penalty term penalised regression can reduce the problem of correlated predictors by in effect selecting one or a few SNPs that best represent an association signal shared across a set of SNPs in high LD.

If the penalty term can be interpreted as a probability density, penalised regression is equivalent to maximum *a posteriori* (MAP) estimation of the regression coefficients using this prior density. Although it is then convenient to use the Bayesian language of prior and posterior, for computational reasons we do not derive the full Bayesian posterior distribution for the regression coefficients, but only the MAP estimates.

A commonly-used prior for penalised regression is the double exponential (DE). In linear regression, use of the DE prior corresponds to the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm (Tibshirani, 1996). For the logistic regression equivalent, we used the Bayesian Binary Regression (BBR) software (Genkin et al., 2007). If the DE scale parameter is large, corresponding to a prior belief that the most likely effect sizes are close to zero, the resulting MAP estimates are often exactly zero thus leading to sparse models. We also considered a generalisation of the DE prior called the Normal-Exponential-Gamma (NEG), implemented in the HyperLASSO (HLASSO) algorithm (Hoggart et al., 2008). The NEG has both shape and scale parameters, and smaller shape values correspond to increasing steepness of the density curve near zero but flatter tails, which is expected to lead to sparse sets of nearly-uncorrelated SNPs.

Genotype imputation using MACH (Li et al., 2006; Li et al., 2009) and the 60 CEU HapMap parents was employed to combine the GoRA study with the Wellcome Trust Case Control Consortium (WTCCC) controls and RA cases (The Wellcome Trust Case Control Consortium, 2007) (Fig. 1A). After quality control (QC), the combined dataset included 2679 RA cases and 3896 controls with genotypes at 6180 SNPs spanning the HLA region. Only 5% of SNPs were genotyped in both WTCCC and GoRA, while 56% of SNPs were not genotyped in either study but entirely imputed from the HapMap (Fig. 1B). The GoRA and HapMap subjects, and about half of the WTCCC controls, had available four-digit *HLA-DRB1* alleles; SE and also *DRB1**0101, one of the SE alleles, were imputed in the remaining individuals.

We employed penalised regression on the combined dataset to identify a sparse set of uncorrelated SNPs that explains most of the non-*DRB1* association with RA. Bootstrapping was implemented to investigate the robustness of the models derived, and the North American Rheumatoid Arthritis Consortium (NARAC) (Plenge et al., 2007) study was used to validate the findings.