Abstract
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- REFERENCES
Penalized regression methods offer an attractive alternative to single marker testing in genetic association analysis. Penalized regression methods shrink down to zero the coefficient of markers that have little apparent effect on the trait of interest, resulting in a parsimonious subset of what we hope are true pertinent predictors. Here we explore the performance of penalization in selecting SNPs as predictors in genetic association studies. The strength of the penalty can be chosen either to select a good predictive model (via methods such as computationally expensive cross validation), through maximum likelihood-based model selection criterion (such as the BIC), or to select a model that controls for type I error, as done here. We have investigated the performance of several penalized logistic regression approaches, simulating data under a variety of disease locus effect size and linkage disequilibrium patterns. We compared several penalties, including the elastic net, ridge, Lasso, MCP and the normal-exponential-γ shrinkage prior implemented in the hyperlasso software, to standard single locus analysis and simple forward stepwise regression. We examined how markers enter the model as penalties and P-value thresholds are varied, and report the sensitivity and specificity of each of the methods. Results show that penalized methods outperform single marker analysis, with the main difference being that penalized methods allow the simultaneous inclusion of a number of markers, and generally do not allow correlated variables to enter the model, producing a sparse model in which most of the identified explanatory markers are accounted for. Genet. Epidemiol. 34:879–891, 2010. © 2010 Wiley-Liss, Inc.
INTRODUCTION
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- REFERENCES
Regression methods are commonly used in statistical analysis, and the recent move to single marker testing in genetic association studies [WTCCC, 2007] has been out of necessity on account of the large number of predictor variables (upwards of 500,000 genetic markers) to be examined. Analyzing markers together in a regression model allows one to consider the impact of markers on other markers; a weak effect may become more visible when other causal effects are already accounted for, and a false signal maybe removed by the inclusion of a stronger signal from a true causal association. However, when the number of markers is larger than the number of test subjects or when variables are highly correlated, standard regression methods become overwhelmed. Penalized regression methods offer an attractive alternative. These methods operate by shrinking the size of the coefficients, pushing the coefficients of markers with little or no apparent effect on a trait down toward zero, reducing the effective degrees of freedom and in many cases performing model selection. Some penalization methods simply reduce the magnitude of the regression coefficients, while others coerce them to be zero. In genetic association analysis, we expect only a few markers to have a real effect on our trait (i.e. to be genuinely causal, or in linkage disequilibrium (LD) with a causal variant). Thus, through use of penalization, we can find the subset of markers most associated with the disease. One potential problem with penalization approaches is that a variable typically enters the model only if it significantly improves prediction. Thus, a variable with a strong marginal effect can be overlooked if other variables explain the effect. However, arguably, one would hope that the selection procedure would select the variables that do indeed best explain the data. Ideally, one strives to mimic the true underlying model, penalizing and thus eliminating non-causal loci, while leaving true causal loci unpenalized. A good penalty should result in minimally biased estimators, a sparse model and continuity to avoid instability in model prediction.
As an illustrative example, see Figure 1. This figure shows association test results for simulated data in a region of high LD containing multiple causal loci. The trend test gives small P-values at many loci correlated with a causal locus, making it difficult to localize the causal locus. Penalized methods are more particular, selecting only one or several variables per causal locus.
In the genetics literature, use of these kinds of approach is just starting to emerge. Ridge regression has been used for distinguishing between causative and noncausative variants for quantitative traits [Malo et al., 2008]. For binary traits, the normal exponential-γ (NEG) distribution [Hoggart et al., 2008], elastic net [Cho et al., 2009] and group lasso [Croiseau and Cordell, 2009] have been applied in order to identify important individual single nucleotide polymorphisms (SNPs), while penalized logistic [Park and Casella, 2008] and least angle [Zhang et al., 2008] regression have been used for identifying gene-gene interactions.
Unfortunately, all penalization methods require specification of a penalization parameter (often referred to as the tuning or regularization parameter), and the parameter value yielding the optimal model is data driven. The choice of penalty parameter controls the number of variables selected: the larger the penalty, the smaller the selected subset. The value of the penalization parameter must be chosen, e.g. through cross validation, to avoid selection of a sub-optimal value. An additional problem with penalization approaches is that there are no efficient ways of obtaining a confidence interval or P-value for a coefficient. This must be done through a procedure such as bootstrapping, and still does not reflect a true P-value in the usual sense due to the complex selection procedure for the reduced model. Wu et al. [2009] have suggested calculating a leave-one-out index for each coefficient (based on likelihood ratio tests, leaving one variable out of the final model at a time), as a useful tool for comparing variables in the model. Another downfall of penalization is that many penalties not only shrink small coefficients but tend to overshrink what should be large coefficients, adversely affecting the value of the model in prediction. However, if the primary interest lies in locating possible disease genes, prediction per se is not of great concern.
Given the current interest in these types of method, the objective of this study is to evaluate the performance of several software packages for performing penalized logistic regression, using computer simulations. We focus on two different problems: first the issue of detection of a relevant SNP or region, and second the issue of differentiation/localization i.e. fine mapping within a region and distinguishing those variants that best explain the association and are thus most likely to be causal, or be in strongest LD with the causal variants. These two separate questions have not always been clearly distinguished in the literature—for example, Malo et al. [2008] compared ridge regression with simple and multiple linear regression using receiver operating characteristic (ROC) curves, without acknowledging that the hypotheses being tested by the different methods differed. (The ROC curves for simple linear regression counted detection of a locus in LD with the true functional variant as a type 1 error—as would be correct if one were using linear regression to try to differentiate between causal and noncausal variants—but in fact simple linear regression would generally be used for detection of effects, not for differentiation of effects, and so such an observation should actually have counted towards the power.)
We simulate data under a variety of disease locus effect size and linkage disequilibrium patterns. We compare several penalties, including the elastic net, the Lasso, a pseudo-ridge, the minimax concave penalty (MCP) and the NEG shrinkage prior to standard single locus analysis (the Armitage Trend Test) and simple forward stepwise regression. Software packages were selected for ease of use and to cover a wide range of penalty functions. We explore how markers enter a model as penalties or P-values are varied, and report the sensitivity and specificity for each of the methods.
METHODS
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- REFERENCES
Binary traits such as case/control status are generally analyzed using logistic regression. Given a dichotomous phenotype vector Y of n observations, and a matrix of SNP genotypes X, let p = P(Y = 1∥X = x). The likelihood is:
where
or equivalently,
and β is our vector of coefficients.
In genetics, the advent of large-scale genotyping has lead to underdetermined problems where the number of markers is much larger than the number of individuals. In this case, standard logistic regression cannot produce a unique interpretable model. Penalized likelihood methods maximize the loglikelihood subject to a penalty which is dependent on the magnitude of the estimated parameters. A penalty on the likelihood will penalize models which have a large number of large regression coefficients, and thus will be optimized with a sparser model. In genetics, we have typically have many variables, but suspect that there are only a few underlying causal variants. An ideal penalty would quickly weed out variables with little effect, with only the most relevant variables remaining in the model. Use of a selection criterion, such as the Bayesian Information Criterion (BIC), is a form of regularization with a penalty that relies only on the number of coefficients, but not on their magnitude.
The most well-known penalty, L1 penalty, was introduced in the form of the least absolute shrinkage and selection operator or Lasso constraint by Tibshirani in 1996. The selection process in the Lasso is based on constructing continuous trajectories of regression coefficients as functions of the penalty level, which results in a more stable solution than subset selection methods [Efron et al., 2004]. To solve the Lasso regression maximization problem, methods such as least angle regression (LARS) [Efron et al., 2004], which can compute the entire piecewise linear path of the Lasso estimates for all values of the penalty parameters, or cyclic coordinate descent [Friedman et al., 2007; Wu and Lange, 2008] have been used. Lasso logistic regression is implemented in R packages such as penalized [Goeman, 2009], glmnet [Friedman et al., 2010], grplasso [Meier et al., 2008] and in other packages such as Bayesian binary regression (BBR) [Genkin et al., 2007] and the genetics software package Mendel [Wu et al., 2009]. The coefficient estimates from the Lasso procedure can also be interpreted in a Bayesian framework as posterior mode estimates using a Laplace (double exponential) prior on the coefficients [Tibshirani, 1996]. This observation has provided motivation for fully Bayesian approaches such as the Bayesian Lasso [Park and Hastie, 2008].
The Lasso will encourage sparsity, setting most small coefficients to zero, due to the penalty function's sharp peak at zero. However, given a sufficiently large penalty parameter, the Lasso will also impose heavy shrinkage on large coefficients due to the absence of tails (constant rate of penalization), leading to biased coefficient estimates (see Fig. 2). A similar issue of bias is seen for other commonly used penalty functions such as the ridge (L2 penalty) [Hoerl and Kennard, 1970; Le Cessie and van Houwelingen, 1992], the elastic net (a mix of L1 and L2 penalties) [Zou and Hastie, 2005] and convex bridge regression (Lq penalty for 1<q<2). Many recent methods strive to relieve some of this bias by introducing penalties with flatter tails, so that large coefficients are only minimally shrunk (see Fig. 2). These include nonconvex bridge regression (Lq penalty for 0<q<1), the NEG prior implemented in hyperlasso [Hoggart et al., 2008] and thresholding penalties such as the smoothly clipped absolute deviation penalty (SCAD) [Fan and Li, 2001] and the MCP [Breheny and Huang, 2008; Zhang, 2007]. For a review of some of these methods and their properties, see Hesterberg et al. [2008].
In penalized likelihood inference, our objective function may be written as:
where the penalty f is a function of the regression coefficients (and possibly a mixing parameter). The amount of shrinkage is directly controlled by the derivative of the penalty function, and many different penalty functions have been proposed. For example, the elastic net criterion is:
where
and
are the L1 and L2 norm, respectively. Here, λ controls the strength of the penalty while α is a mixing parameter that determines the strength of the L1 versus the L2 norm. The elastic net criterion above is reduced to the Lasso if we let α = 1. If α = 0, we have the L2 penalty used in ridge regression. The Lasso has been shown to be consistent for model selection under certain conditions given that the correlation between relevant (true) and irrelevant predictors is not too large [Zhao and Yu, 2006], or in other words, the selected model will contain the true model with high probability. However, the Lasso can select at most n (the number of observations) nonzero parameters, and cannot have a unique solution if any two variables are completely collinear. The elastic net is a stabilized version of the Lasso; it encourages groups of correlated variables to enter the model together, and thus a small change in the data will not have a large effect on the model. In our investigation, we use the glmnet software by Zou and Hastie [2005], to implement the Lasso, the elastic net with α = 0.4, and approximate ridge regression with α = 0.05. (We use approximate ridge regression because standard ridge regression only shrinks coefficients and does not set them to zero [Tibshirani, 1996], thus it does not automatically perform subset selection and one must compute a score for each coefficient for model selection.)
The MCP [Breheny and Huang, 2008; Zhang, 2007] is a nonconvex penalty that applies the same rate of penalization as the Lasso when the coefficients are near zero. The rate of penalization is the derivative or slope of the penalty function. As a coefficient moves away from zero, the rate of penalization is continuously relaxed until a defined threshold where the rate of penalization drops to zero. All coefficient values above this threshold contribute equally to the total penalty, so that very large coefficients do not increase the penalty too much, leading to less biased estimates of the large coefficients. The MCP is implemented in the R software package grpreg, and for our analyses we set a = 30, the default (where a is a tuning parameter related to the threshold at which the rate of penalization drops).
The method implemented in hyperlasso [Hoggart et al., 2008] is a Bayesian-inspired penalized maximum likelihood approach using a NEG prior. The NEG prior is a continuous prior distribution with a sharp mode at zero which has the effect of shrinking the regression coefficients heavily when they are near zero. The penalty function is derived by taking the logarithm of the prior, yielding a penalty that is a function of the coefficients squared and the logarithm of a parabolic cylinder function of the absolute value of the coefficients. For our analyses we set the shape parameter to 0.1.
All penalized regression methods require input of one or more values for the penalization parameter(s). From here on we will refer to this parameter as λ. The parameter value must be selected (e.g. through cross validation) to avoid selection of a sub-optimal parameter value. In general, a very large value of λ will only allow a small number of variables to enter the model. If we choose the value too small, the number of variables in the model may be too large, and our coefficients become less reliable because of their high variances; we approach standard regression with most of the variables included the model and overfitting occurs. The best choice of λ is data dependent and may vary, for example, from chromosome to chromosome, or window to window, within the same data set.
Due to the difficulties in finding an optimal value for λ, we found it most insightful to look at how variables enter a model as the penalization parameter is relaxed, or in nonpenalized methods as the P-value threshold for declaring significance is relaxed. Penalized methods tend to select only one or few variables belonging to a group of correlated variables, resulting in a sparse model. The question remains: are they missing anything? In our investigation we vary the value of λ to allow approximately only one additional variable enter the model at each step. We record the number of true positives and false positives at each value of λ for each method, along with the maximum LD between a false signal and a causal locus and the maximum LD between a missed causal locus and a signal.
We used computer simulation to compare five penalized methods (Lasso, elastic net (EN), ridge, MCP and NEG) to the Armitage trend test (ATT) and simple forward stepwise regression (FSTEP). All methods were run under the assumption of an additive allelic disease model. For the ATT, we look at how variables become significant as we increase the P-value threshold for declaring significance. In FSTEP, at each step the variable with the smallest P-value is added to the model, until no more variables reach the required threshold. We compare methods first with respect to detection of effects, in which detection of an allele in LD (r2>0.05) with a true causal variant counts as a success (and any other detection counts as a false positive), and second with respect to localization/differentiation, in which we only count detection of the true causal locus itself as a success.
SIMULATION STUDY 1: DETECTION OF EFFECTS
Following the approach of Hoggart et al. [2008], we used the program FREGENE [Chadeau-Hyam et al., 2008; Hoggart et al., 2007] to simulate a population of 10,000 individuals with a sequence length of 20 Mb using recombination and mutation rates representative of a human population. We then thinned the data down to 4,000 SNPs with frequencies >0.01, retaining approximately one SNP every 5 kb. We then used the program SAMPLE [Chadeau-Hyam et al., 2008; Hoggart et al., 2007] to simulate 500 replicate data sets, requesting 1,000 cases and 1,000 controls generated from a model with six causal loci of varying allele frequencies of approximately 15, 5 and 2%, and risk ratios ranging from 1.4 to 3.0.
Genotype variables were used as predictor variables in each of the methods. Both glmnet and grpreg standardize the genotypes to have mean zero and variance 1 by default, and thus for comparability we also standardized the genotypes for the NEG and FSTEP. As the causal loci are not necessarily in the thinned SNP set, and our markers have low enough density that LD between them is fairly small, we report a true positive detection when a SNP of r2>0.05 with a causal variant is selected. A False-positive occurred when a SNP of r2≤0.05 was selected; however, a false positive was only recorded if it was more than 20 kb from a previously detected false positive, as was done by Hoggart et al. [2008]. This allows us to count a set of signals from a group of highly correlated variables as a single signal to prevent inflated false-positive rates in methods such as the ATT and the EN.
SIMULATION STUDY 2: FINE MAPPING AND DIFFERENTIATION/LOCALIZATION OF EFFECTS
We also performed simulations based on three known gene regions of varying LD patterns (see Fig. 3): CYP2D6, CFTR and CTLA4 containing 110, 190 and 228 SNPs, respectively. Although it would be possible to do standard logistic regression in this case (owing to the smaller number of markers compared to our previous simulation), the fact that the genotype variables are highly correlated leads to estimates with large variances.
With the software HAPGEN [Marchini et al., 2007], we simulated 500 replicate data sets for each of the three regions. We used the 120 haplotypes from the HapMap [The International Hapmap Consortium, 2003] CEU population as the basis of our simulations. In each replicate, a population of 20,000 individuals was created using the HapMap CEU haplotypes, and 1,000 cases and 1,000 controls were sampled from that population. For each region we considered six different underlying genetic model scenarios. We picked five “common” causal loci of varying but high minor allele frequency (see Table I) within each gene region. In scenarios 1–5, each locus in turn was used as the single (only) disease-causing variant, whereas in scenario 6, all five loci were assumed to act together multiplicatively to increase disease risk. The relative heterozygote and homozygote risk ratios in each scenario were set at 1.3 and 1.7, and the penetrance was chosen to give a population prevalence of approximately 10–13%. To compare the situation of common alleles having small effects with that of rare alleles having strong effects, the simulation was repeated using smaller minor allele frequencies, and relative risks chosen to give approximately 80% power for the detection of the individual SNP (see Table I).
Table I. Generating allele frequencies of the five causal loci| | Allele frequencies |
|---|
| | RR of 1.3 and 1.7 | RR to give 80% power |
|---|
| | Gene | Gene |
|---|
| Locus | CYP2D6 | CFTR | CTLA4 | CYP2D6 | CFTR | CTLA4 |
|---|
|
| 1 | 0.325 | 0.050 | 0.375 | 0.100 | 0.050 | 0.030 |
| 2 | 0.050 | 0.175 | 0.450 | 0.008 | 0.080 | 0.008 |
| 3 | 0.192 | 0.475 | 0.183 | 0.025 | 0.025 | 0.050 |
| 4 | 0.242 | 0.242 | 0.275 | 0.043 | 0.008 | 0.025 |
| 5 | 0.450 | 0.367 | 0.050 | 0.017 | 0.042 | 0.017 |
The simulated data were analyzed using the same five penalized methods as in Simulation Study 1. Since in this second experiment we were interested in the performance of the methods with respect to differentiation of SNPs, in common with Malo et al. [2008] we counted detection as selection of the true causal locus and considered all other detections a false positive. However, in a similar manner to Simulation 1, we considered a false signal in very close proximity (10 kb) to a previously reported false positive as a single false-positive signal.
DISCUSSION
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS
- DISCUSSION
- REFERENCES
In this study we have examined the performance of a variety of different penalized regression approaches and compared their performance with respect to (a) detection and (b) distinguishing of true from false causal variants in genetic association studies. Although the performance of some of the individual methods we considered has previously been examined, to our knowledge there has been no comprehensive comparison between methods and between such methods and the simpler approaches that are often used in genetic association studies (such as the Armitage trend test and forward stepwise regression). Moreover, some previous studies [Malo et al., 2008] have been plagued by confusion over whether the methods were being assessed with respect to (a) or (b).
Overall, we found broadly similar performance between the different penalization methods, with NEG giving the overall best and the ATT the overall worst performance. Although larger parameter estimates are always more heavily penalized, methods that apply larger relative penalties on small parameter estimates and relatively lower penalties to larger estimates seem to perform better and more accurately estimate the effect size of the selected SNPs. These penalties prevent variables that enter the model early from leaving the model later on, and exclude the entry of additional variables whose effects are solely due to correlation with already included variables. As we relax the error rate, penalized methods outperform the ATT in detection, and may be useful for further exploration of causal variants.
Although forward stepwise regression is conservative for very stringent P-values, for more relaxed P-values it generally performed slightly worse than the penalized methods with a higher false-positive rate and lower sensitivity, although it did perform well with respect to fine mapping in the simulations where only a single causal variant existed in a region. Forward stepwise regression has the disadvantage that once a variable enters the model it cannot correct itself by removing it if other variables are a better fit; it is known to be greedy and unstable [Breiman, 1996]. However, a forward/backward selection procedure might perform better, although penalization methods would still be expected to be less greedy. Penalized methods have previously been shown to give superior performance over stepwise elimination/addition algorithms that often lead to local rather than globally optimal solutions [Breiman, 1995]. However, in spite of their theoretical limitations [Hastie et al., 2001], stepwise regression approaches [Cordell and Clayton, 2002] have frequently been used to differentiate between potentially causal and noncausal variants in genetic association studies [Barratt et al., 2004; Plenge et al., 2007; Scott et al., 2007; Ueda et al., 2003] and appear to work rather well in practice [Charoen et al., 2007].
As well as giving the overall best performance in our study, the NEG has the advantage of being genuinely applicable to genome-wide data comprising many thousands of predictor variables, unlike most of the other penalized approaches we considered, which suffer from limitations with respect to the number of markers that can be considered simultaneously [Croiseau and Cordell, 2009]. Given the recent success of single-marker approaches in detecting effects in genome-wide association studies [Easton et al., 2007; Fellay et al., 2007; Frayling et al., 2007; Todd et al., 2007; WTCCC, 2007; Zeggini et al., 2008, 2007], in practice it is hard to imagine not undertaking a first-pass single-locus analysis. However, the superior performance of the NEG with respect to detection as well as with respect to differentiation/localization of effects suggests that, for most genome-wide studies, a follow-up analysis using the NEG or similar (to generate a sparser model in which the most important explanatory markers are accounted for) would be a worthwhile undertaking.