SEARCH

SEARCH BY CITATION

Keywords:

  • AUC;
  • Cerebral infarction;
  • Conditional logistic regression;
  • Elastic net;
  • Lasso;
  • Penalized likelihood;
  • ROC analysis

Summary

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

Matched case-control designs are commonly used in epidemiologic studies for increased efficiency. These designs have recently been introduced to the setting of modern imaging and genomic studies, which are characterized by high-dimensional covariates. However, appropriate statistical analyses that adjust for the matching have not been widely adopted. A matched case-control study of 430 acute ischemic stroke patients was conducted at Massachusetts General Hospital (MGH) in order to identify specific brain regions of acute infarction that are associated with hospital acquired pneumonia (HAP) in these patients. There are 138 brain regions in which infarction was measured, which introduce nearly 10,000 two-way interactions, and challenge the statistical analysis. We investigate penalized conditional and unconditional logistic regression approaches to this variable selection problem that properly differentiate between selection of main effects and of interactions, and that acknowledge the matching. This neuroimaging study was nested within a larger prospective study of HAP in 1915 stroke patients at MGH, which recorded clinical variables, but did not include neuroimaging. We demonstrate how the larger study, in conjunction with the nested, matched study, affords us the capability to derive a score for prediction of HAP in future stroke patients based on imaging and clinical features. We evaluate the proposed methods in simulation studies and we apply them to the MGH HAP study.

1. Introduction

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

Hospital acquired pneumonia (HAP) is a major complication after stroke, and is associated with higher mortality, larger neurological deficits, longer hospitalization and increased costs for medical care (Davenport et al., 1996; Katzan et al., 1998). A study was conducted in a cohort of 1915 acute ischemic stroke patients admitted to Massachusetts General Hospital (MGH) between June 2004 and March 2008, for which the outcome of interest (HAP), demographic information and clinical variables were obtained for all participants. The prevalence of HAP among this cohort was 12%. The goal of the study was twofold: (1) to identify specific brain regions of acute infarction that are associated with HAP in acute ischemic stroke patients and (2) to derive a prediction rule based on imaging and clinical variables for early assessment of the risk of HAP in new patients. However, as it would have been prohibitively expensive and time-consuming to conduct neuroimaging studies for the entire cohort, a nested matched case-control study design was conducted. In this substudy of acute stroke patients, 215 with HAP were matched with 215 without HAP, based on gender, the National Institutes of Health stroke scale (NIHSS) and age. Neuroimaging analysis was undertaken for these matched pairs. The resultant 138 imaging variables, along with their nearly 10,000 two-way interactions, which are important to consider given spatial correlations in the brain, present a challenging high-dimensional variable selection problem. Another challenge is how to optimally leverage the information in the prospective cohort to use alongside that in the nested study, for development of a powerful prediction rule for future patients.

Many high-dimensional variable selection methods within the context of likelihood-based estimation have been developed, including penalized regression approaches, such as lasso (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005); an overview is provided by Fan and Lv (2010). Additional issues arise when interaction terms are also of interest, including substantially increased dimensionality and desired differential weighting of main effects and interaction effects. Wu et al. (2009) developed a lasso penalized logistic regression model with a cyclic coordinate ascent algorithm to select both main and interaction predictors in the context of genome-wide association studies. Wu et al. (2010) proposed an alternative lasso penalization procedure called “screen and clean” for the purpose of identifying and then “cleaning” interactions in case-control genome-wide association studies. Neither of these proposals addressed matched study designs.

Matched case-control designs are commonly used in epidemiologic studies for increased efficiency by enforcing similar distributions of confounding variables for cases and controls (Rothman, Greenland, and Lash, 2008). These designs have recently been introduced to imaging and genomic studies, which are characterized by high-dimensional covariates. However, with few exceptions, statistical analyses that undertake high-dimensional feature selection do not appropriately adjust for the matching (e.g., Davatzikos et al., 2008; Duchesne et al., 2009). We identified only three articles that do acknowledge the underlying matched designs. Tan, Thomassen, and Kruse (2007) proposed a two-stage variable selection procedure for matched case-control microarray data, where in the first stage, important features are identified using a modified paired t-test statistic, and in the second stage, a support vector machine classifier is built using these selected features. The second stage classification ignores the matched design, which may lead to biased results (Breslow and Day, 1980). Adewale, Dinu, and Yasui (2010) developed two variants of boosting for classification of matched-pair or correlated binary responses with high-dimensional predictors. One is based on the generic functional gradient descent boosting algorithm and employs a loss function that handles correlation in binary data. The other is a likelihood-based boosting via generalized linear mixed models that handles the correlation via a random effect. This approach entails strong modeling assumptions. Balasubramanian et al. () proposed a random forest penalized conditional logistic regression algorithm, which adjusts for the matched case-control design through the conditional likelihood and incorporates certain attractive features of the random forest machinery to assess variable importance in a high dimensional feature space.

When potential risk factors are expensive, case-control studies are often nested within larger cohort studies and the measurements are taken only from the subjects in the nested studies. This design is cost-effective and efficient (Lin and Ying, 1993; Woodward, 2005), and is widely used in large epidemiology cohort studies such as the Nurses’ Health Study (Tworoger et al., 2011). They are even more important in the context of very high dimensional and expensive imaging and genetic variables, such as the MGH HAP study.

With nested, matched case-control data, we also seek to derive a prediction rule for the outcome (e.g., HAP) in future patients. It is well known that if a logistic regression model holds for a prospective sample, then it holds as well for a retrospective case-control sample, albeit with a different log-odds parameter (Mantel, 1973; Prentice and Pyke, 1979). The difference in the log-odds parameters between the retrospective and prospective models is a function of the sampling probabilities of the cases and controls. Thus, if the sampling probabilities are known, a risk prediction model for future subjects can be constructed simply by fitting a logistic regression model to the retrospective case-control sample and adjusting the intercept. With matched case-control data, the adjustment must be conditional on the matching variables. The articles that propose techniques for high dimensional variable selection based on matched retrospective studies do not address the problem of prediction.

In this article, we propose and assess penalized variable selection strategies to select important main effects and two-way interaction effects generated by the imaging data in the MGH HAP matched case-control study. Furthermore, we extend the results of Mantel (1973) and Prentice and Pyke (1979) that elucidated the link between the prospective and retrospective logistic regression models to our setting of the nested, matched case-control study. In Section 2, we specify population-level and sampling models for HAP, and describe two procedures for our dual goals of variable selection and prediction. In Section 3, we develop a two-stage procedure for variable selection based on a penalized conditional likelihood for the nested, matched case-control study. We then derive a prediction rule for future subjects based on the selected imaging variables and the matching variables by leveraging the larger study from which the case-control study was sampled. In Section 4, we develop a simultaneous procedure for variable selection and prospective modeling. We report the results of simulation studies in Section 5, and we apply the proposed methods to the MGH HAP study in Section 6. We end with a discussion in Section 7.

2. Notation, Models and Sampling

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

Let Y be binary response variable (inline image or 0 for presence or absence of HAP), inline image the vector of predictors, inline image the vector of two-way interactions of the predictors in inline image, and inline image the vector of matching variables. We assume that a logistic regression model holds on the population level,

  • display math(1)

where inline image, inline image, inline image. In high dimensional settings, it is likely that several of the predictors are not associated with the response, and thus several components of inline image equal 0. In addition, among the set of inline image with nonzero coefficients, only a subset of their two-way interactions have nonzero coefficients.

As for the MGH HAP study, N subjects are prospectively enrolled into the study, and the response variable, Y, and strongly associated clinical variables, inline image, are recorded for each subject. Due to limited resources, and with the goal of efficiency, the predictors of interest, inline image, are not measured for all N subjects. Instead, a nested, matched case-control study is conducted. In particular, n pairs of subjects from this cohort are sampled, such that each pair contains one case (inline image) and one control (inline image) which are matched on the important clinical variables, inline image. The observed data for this substudy are inline image, with the index inline image indicating case and inline image indicating control. The standard conditional likelihood approach for analysis of these data enables estimation of inline image, but not of inline image or inline image, which would be needed for prediction of HAP in future patients.

Let inline image indicate that a subject is sampled into the matched case-control study and inline image otherwise. We assume that given inline image, the sampling probability does not depend on inline image, that is, inline image, which is usually satisfied in practice. Under the one-to-m matched case-control design, in which inline image, it follows that:

  • display math

Therefore, it follows from (1) that the probability of HAP among those stroke patients sampled for the case-control study is

  • display math(2)
  • display math

Note that inline image is estimable from the prospective study from which the matched case-control study was sampled; we provide details in Section 6.1. Also, inline image is not of use for the prediction of HAP in future stroke patients. Rather, it enables us to estimate inline image and inline image using the nested matched study, and these parameters are necessary for prediction for future patients.

This sampling model (2) suggests two approaches that will achieve the dual goals of variable selection and prediction. A two-stage approach entails estimation of inline image using the conditional likelihood for the matched pairs, and then use of inline image as an offset in an unconditional logistic regression model for estimation of inline image and inline image. The second approach does not utilize conditional logistic regression, but rather fits an unconditional logistic regression model for the matched case-control data, using inline image as an offset. This enables simultaneous estimation of inline image, inline image, and inline image. The advantage of the first approach is that estimation of inline image is not compromised by variability in inline image or by instability induced by simultaneous estimation of inline image and inline image. The advantage of the second approach is that it is a single stage approach. Without loss of generality, we assume one-to-one matching in the remainder of this article.

3. Two-Stage Procedure for Variable Selection and Prediction

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

For the two-stage approach, we first conduct variable selection, and estimate inline image using the conditional likelihood appropriate for the matched case control study, based on (1):

  • display math(3)

3.1. Variable Selection

Stepwise selection is commonly used for variable selection. However, it is well known to overfit the data. Logic regression (Ruczinski, Kooperberg, and Leblanc, 2003) was designed for selection of binary predictors, though can be extended to continuous ones through construction of categorical variables with associated indicators. It finds optimal Boolean combinations of predictors within the context of any likelihood function, but has downside of being computationally expensive, even when a relatively efficient simulated annealing algorithm is adopted. Thus, we focus on penalized likelihood methods. We compare the performance of our proposed methods to stepwise and logic regression for the MGH HAP study in Section 6.

We define the penalized log likelihood as

  • display math

where inline image is the log conditional likelihood given in (3), inline image is a penalty function for the main effects and inline image is a penalty function for the two-way interaction effects. These functions are indexed by tuning parameters inline image. The Lasso penalty defines inline image, and the ridge defines it as inline image. The lasso penalty serves to shrink the regression coefficients toward zero; the larger the value of inline image, the greater the amount of shrinkage. As inline image increases, the lasso regression will set some of the coefficients to be exactly zero and obtain a subset of covariates with nonzero regression coefficients. The ridge penalty serves to shrink the regression coefficients toward zero and each other by imposing a penalty on their size, but does not reduce the number of covariates with nonzero coefficients. The elastic net procedure combines the lasso and the ridge, and implements variable selection, like the lasso, and also shrinks together the coefficients of strongly correlated predictors, like the ridge. Thus, strongly correlated predictors tend to be in or out of the selected model together. This is potentially advantageous over the lasso when some true predictors are highly correlated, as are some of the imaging variables, and it is of scientific interest to identify them all. When variable selection is an important aim, both lasso and elastic net penalization are of interest. In studies with predefined groupings of variables that are known to operate jointly on the outcome, the group lasso (Meier, van de Geer, and Bühlmann, 2008) could be used.

Due to the large number of imaging variables (138), the much larger number of two-way interactions between them, and our desire to weight the main effects and interaction effects differentially, we consider three different penalization strategies, termed “Pen1”, “Pen2” and “Pen3” when based on the lasso, and “EN1”, “EN2” and “EN3” when based on elastic net:

  • Pen1/EN1:
    This strategy does not include any interactions. We fit penalized conditional logistic regression models with main effects, inline image, only.
  • Pen2/EN2:
    This strategy considers interactions of pre-selected main effects. First, we fit univariate conditional logistic regression models with each of the predictors, inline image, and select the predictors whose p-values are less than a pre-selected threshold. Second, we fit conditional multiple logistic regression models with a penalty on the two-way interactions of the selected main predictors.
  • Pen3/EN3:
    This is a variation on Pen2/EN2 that involves joint selection of main effects using Pen1/EN1, followed by penalized selection of interactions of selected main effects.

We also considered simultaneous selection of main effects and interactions, with two different penalty parameters to allow for differential weighting of the two types of terms, but encountered numerical instabilities in this approach that made it infeasible.

3.2. Cross-Validation

We employ ten-fold cross-validation to select the optimal penalty parameters, and to evaluate the performance of proposed variable selection strategies (using the corresponding optimal penalty parameters). We randomly split the original n pairs of case-control observations into 10 groups. Using 9 out of 10 groups as the training dataset, we implement each of the three proposed procedures. We then apply the final fitted regression models to the group that was omitted from the training set and calculated the (unpenalized) conditional log likelihood for that subset of pairs. We repeat this procedure ten times so that each group of pairs served as the validation set once. The sum of all of the conditional log likelihood contributions from the validation groups is the cross-validation score. The procedure with the highest score at its optimal penalty parameter has the best performance. Variable selection is completed by maximizing the penalized likelihood at the optimal penalty parameter for the entire dataset.

3.3. Estimation of Population Parameters for Prediction

We estimate inline image using the entire cohort (see Section 6.1) and then, inserting inline image and inline image as offsets, we fit the unconditional logistic regression model (2) to the matched case control data. This provides us with estimates of the population parameters inline image. Prediction for future patients will be based on the linear score function, inline image.

4. Simultaneous Variable Selection and Prediction Procedure

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

As seen in the derived regression model for the sampled data given in (2), if we have available an external estimate for inline image, as we do in the case of a nested study, we can directly fit an unconditional logistic regression model to the nested matched case-control data for simultaneous estimation of inline image, inline image, and inline image using the unconditional pseudo-likelihood:

  • display math(4)

We apply the variable selection strategies outlined in Section 3.1 to this unconditional pseudo-likelihood. We apply the penalization with cross-validation to inline image only, and not to inline image. Penalized pseudo-likelihood approaches have been used successfully for variable selection in other contexts, as well (e.g., Cai et al., 2005); we evaluate the effect of plugging in inline image in simulations.

5. Simulation Studies

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

We conducted three sets of simulation studies to compare the performances of our three variable selection strategies with the lasso and elastic net penalties, and to compare the two-stage and simultaneous procedures. We also compared the performance of the proposed methods with stepwise regression. We used the R package “penalized”.

In all simulation scenarios we first generated a prospective sample from a population logistic model, and then selected the nested matched case-control study from it. In the first set of simulations, we included a moderate number of predictors (50) and explored various association structures among the non-null predictors and the null predictors and different effect sizes of regression coefficients. Our main interest was to investigate the impact of these associations and effect sizes on the performance of the various variable selection procedures and on their subsequent predictive performances. In the second set of simulations, we mimicked the MGH HAP study in numbers of predictors, associated block correlation structure, and matching variables. In the third set of simulations, the number of main predictor candidates is larger than in Setup II, and the predictors are continuous rather than binary, which illustrates the applicability of our method to other settings. In Web Appendix D, we illustrate minimal impact of use of the pseudo-likelihood (4) on prediction performance for the simultaneous procedure under simulation Setup II.

5.1. Simulation I: Moderate Number of Predictors

In this simulation, we considered 50 Bernoulli predictors, of which, five (inline image, inline image, inline image, inline image, and inline image) are associated with the outcome, Y. Within this simulation, we evaluated four scenarios that illustrated the impact of different association structures between inline image, inline image and Z, as well as different association structures between inline image and inline image. The matching variable Z follows a uniform distribution on integers inline image. The population model is as in (1). In scenarios (i) and (ii), inline image, inline image, inline image, inline image, inline image and inline image for all other covariates. We further assumed three non-null interactions, inline image, inline image and inline image, with coefficients, inline image, inline image=inline image, and set the regression coefficient of Z to be inline image. In scenario (iii), inline image, inline image, inline image, inline image and inline image for all other covariates, and we further included the same three non-null interactions with coefficients inline image, inline image, and set inline image. In scenario (iv), inline image, inline image, inline image, inline image, inline image and inline image for all other covariates, and the three non-null interactions with coefficients inline image, inline image=inline image, and inline image. Based on these models, we generated a prospective sample of 2500 subjects, and then selected a nested, matched case-control study, consisting of 250 pairs. We provide additional details in Web Appendix A.

We applied the proposed procedures to each of 500 generated nested case-control studies. Table 1 summarizes the variable selection and prediction results for both procedures. The AUC was calculated by applying the estimated prediction rule from the nested study to an independent validation data set of sample size 2000, which followed the same distribution as the cohort data from which the matched case-control data were sampled. We also list the “true” AUC, which we obtained by fitting the true population logistic regression model to the independent validation data set. It represents the optimal AUC and serves as a benchmark. It is notable that the best of the three procedures with regard to variable selection, as measured by the cross validated log likelihood, almost always corresponds to the best with regard to the ultimate prediction performance, as measured by AUC. The AUC values for the two-stage and simultaneous procedures are comparable.

Table 1. Summary of variable selection and prediction results for simulation setup I
 “True”-AUCStepwisePen1Pen2Pen3EN1EN2EN3
  1. inline imageNote: CV-LK: cross-validated conditional log-likelihood (for two-stage procedure) or cross-validated log-likelihood (for simultaneous procedure); AUC, area under the ROC curve for independent validation dataset.

  Two-stage procedure
  Scenario (i)
CV-LK −186.91−165.83−169.52−183.27−166.08−196.49−183.29
AUC0.6580.5980.6160.6100.60510.6160.6070.605
  Scenario (ii)
CV-LK −178.53−155.21−159.25−170.11−155.20−159.37−170.13
AUC0.6880.6330.6620.6540.6450.6600.6540.644
  Scenario (iii)
CV-LK −187.86−170.08−162.75−180.50−170.03−162.68−180.10
AUC0.7270.5920.6110.6390.6240.6110.6390.624
  Scenario (iv)
CV-LK −155.16−104.35−111.24−136.87−104.88−111.28−137.00
AUC0.8540.7960.8250.8190.8000.8250.8190.800
  Simultaneous procedure
  Scenario (i)
CV-LK −352.43−336.91−342.76−351.71−355.03−347.45−351.87
AUC0.6580.6020.6170.6110.6030.6160.6080.602
  Scenario (ii)
CV-LK −343.55−327.52−331.87−340.74−349.82−337.63−340.91
AUC0.6880.6390.6630.6560.6440.6620.6530.644
  Scenario (iii)
CV-LK −353.84−338.77−335.55−344.09−338.80−334.83−344.10
AUC0.7270.5970.6120.6400.6310.6120.6410.631
  Scenario (iv)
CV-LK −282.21−268.14−271.43−283.92−268.46−271.45−283.93
AUC0.8540.8120.8280.8230.8100.8280.8230.810

The four scenarios in Table 1 have AUC's ranging from 0.66 to 0.85. Scenarios (i) and (ii) have the same model for Y, but different association structures among the covariates. Scenario (iii) has the strongest interaction effects for the outcome model, and Scenario (iv) has the strongest main effects with moderate interaction effects. For both the two-stage and simultaneous procedures, the Pen1 strategy, which selects main effects only, is best for Scenarios (i, ii, and iv), which have relatively weak interaction effects. In contrast, the EN2 strategy, which selects some interaction terms, is best for Scenario (iii), which has the strongest interaction effects. In summary, the performance of the selection procedure is robust to association structure and is influenced by the relative magnitudes of the main versus interaction effects.

5.2. Simulation II: Large Number of Predictors Mimicking Stroke Imaging Study

In this simulation, we mimicked the MGH HAP study with 100 correlated predictors, and three matching variables with marginal distributions that correspond to those of age, gender, and NIHSS from the MGH study, and with correlation between gender and NIHSS, also as in the study. For simplicity, we assumed age to be independent of gender and NIHSS.

We modeled age (inline image) as multinomial on the integers [18,100], gender inline image) as Bernoulli, and NIHSS (inline image) as multinomial on the integers [0,36], all with probabilities as in the MGH study. We took the inline image to have marginal Bernoulli distributions, with p ranging from 0.14 to 0.36, and we assumed a blockwise correlation structure for inline image. We set inline image, inline image, inline image, inline image, and inline image for all other predictors. The nonzero regression coefficients for the interactions are inline image, inline image, inline image, inline image, inline image, inline image, inline image, inline image, inline image, inline image. The regression coefficients for the matching variables are inline image, inline image and inline image. Under this parameter configuration, the prevalence rate for HAP in the population is about 12%, as in MGH study. We generated cohort studies of 2500 subjects, and from them, we retrospectively selected matched case-control studies of 250 pairs of subjects. We provide additional details in Web Appendix B.

The penalization method Pen2 exhibits the best performance in variable selection and prediction (Table 2). We have included averaged 95% bootstrap confidence intervals for the AUC's from the independent samples for the simultaneous procedure under simulation Setup II. We did not include these for all simulations due to the computational demand of conducting a bootstrap within a simulation. These are suggestive that the AUC's from the different strategies are not formally significantly different from each other. However, the slightly larger AUC for Pen2 (0.727) versus Pen1 (0.693) provides informal justification for use of the cross-validated log likelihood as a metric for variable selection strategy, as it does seem to give a boost, albeit small, in AUC in an independent, prospective sample. In applications, it would be worth additionally considering the difference in the sets of variables selected by each procedure and their associated costs; it may be that Pen2 does not add any cost over Pen1, for example, but rather simply includes some interaction terms from among main effects that are already included in Pen1.

Figure 1 depicts the frequency of selection based on the two-stage implementation of procedure Pen2 for each of the 100 main effect predictors and 12 non-null interactions, across the 500 replications. The tick marks on the x-axis in Figure 1a indicate the true main effects. It is apparent that the variables that are selected with highest frequency are the true, non-null, main effects. Two variables, inline image and inline image, which are not true main effects but are selected with high frequency, are correlated with true main effects inline image. The frequency of selection of the non-null interaction effects depends on their magnitude, the magnitude of their associated main effects and the correlation of the associated main effects with other main effects. These factors lead to relatively infrequent selection of inline image. Among the remaining 4938 null interactions, 64.9% are never selected across the 500 replication studies, 18.7% are selected once, 5.7% are selected three times, and the remaining 10.7% are selected four or more times. The results for the simultaneous procedure are similar.

image

Figure 1. Plot of frequency that each variable is selected for Setup II based on Pen2 of two-stage procedure: (a) 100 main effects, and tick marks on the x-axis indicate 10 true main effects, which are inline image. (b) 12 true interaction effects.

Download figure to PowerPoint

Table 3 lists additional summary statistics for the variable selection performance from this simulation. Consistent with the results of Table 2, Pen2 and EN2 outperform the other strategies in true and false positive rates for selection of main effects and interactions. The moderate true positive rates (34–61%), in conjunction with the near-optimal AUC of 72–73% suggests that there is redundancy in variables with regard to prediction, and that the variable selection strategies are optimizing prediction at the expense of variable selection. Interestingly, the elastic net is not performing any better in variable selection than the lasso even in the presence of some correlated variables.

Table 2. Summary of variable selection and prediction results for simulation setups II and III
 “True”       
 AUCStepwisePen1Pen2Pen3EN1EN2EN3
  1. inline imageNote: CV-LK, cross-validated conditional log-likelihood (for two-stage procedure) or cross-validated log-likelihood (for simultaneous procedure); AUC, area under the ROC curve for independent validation dataset. For simultaneous procedure under Setup II, we also give the 95% averaged bootstrap percentile confidence interval for AUC based on 3000 bootstrap samples.

  Simulation Setup II
  Two-stage procedure
CV-LK −785.39−149.02−145.20−157.26−151.82−147.25−158.67
AUC0.8260.6320.6920.7180.6830.6920.7180.684
  Simultaneous procedure
CV-LK −378.95−325.85317.16−343.80−325.97−319.11−338.19
AUC0.8260.6540.6930.7270.6960.6930.7270.698
95inline image CI (0.617,(0.656,(0.692,(0.660,(0.656,(0.692,(0.662,
  0.690)0.729)0.760)0.731)0.729)0.760)0.733)
  Simulation Setup III
  Two-stage procedure
CV-LK −508.80−249.46−228.61−233.27−249.46−228.59−233.26
AUC0.8150.6220.6760.7190.7080.6760.7190.708
  Simultaneous procedure
CV-LK −595.04−527.20−502.45−506.77−527.20−501.74−506.74
AUC0.8150.6380.6770.7230.7140.6770.7230.714

Finally, we assessed the accuracy of estimation of inline image in the context of the simultaneous procedure. We investigated the impact of window size on the estimation of inline image by choosing window sizes that lead to at least 5, or 10, or 15 subjects in a window. We also examined the impact of this estimation on the independent sample AUC. Detailed results are summarized in Web Appendix D.

Table 3. Simulation setup II: Summary of variable selection
MethodsStepwisePen1Pen2Pen3EN1EN2EN3
  1. inline imageNote: TPR, true positive rate; FPR, false positive rate. The total number of main effects is 100, whereas the number of true main effects is 10. The total number of two-way interaction effects is 4950, whereas the number of true interactions is 12.

 Two-stage procedure
 Main effects
Average # of effects found29.6319.1712.2719.4819.1812.2719.78
Average # of true effects found5.785.716.105.775.716.105.76
TPR0.5780.5710.6100.5770.5710.6100.576
FPR0.2650.1500.0690.1520.1500.0690.156
 Interaction effects
Average # of effects found14.099.8514.099.80
Average # of true effects found4.102.234.102.20
TPR0.3420.1860.3420.183
FPR0.0020.0020.0020.002
 Simultaneous procedure
 Main effects
Average # of effects found24.2920.2312.2720.2320.2412.2720.27
Average # of true effects found5.955.936.105.935.936.105.93
TPR0.5950.5930.6100.5930.5930.6100.593
FPR0.2040.1590.0690.1590.1590.0690.159
 Interaction effects
Average # of effects found17.0918.9317.1218.94
Average # of true effects found4.603.404.603.40
TPR0.3840.2840.3840.283
FPR0.0030.0030.0030.003

5.3. Simulation III: Large Number of Continuous Predictors

In this simulation, the number of main predictor candidates is larger than that of Setup II, and they are continuous rather than dichotomous. In these regards, Setup III demonstrates the applicability of our method to common biomarker study settings, such as a matched case-control study conducted by the High Risk Plaque Initiative to discover proteomic and metabolic biomarkers for cardiovascular events (Balasubramanian et al., ).

We considered 150 main predictor candidates, of which 10 are non-null, 9 non-null interactions and 2 matching variables. A cohort of 4000 subjects was generated, and a nested, matched case-control study consisting of 400 pairs was selected. The prevalence rate is 25%. Matching variable inline image is multinomial on the integers [1,10], and inline image has Bernoulli distribution with success probability associated with inline image. We took the inline image to have marginal standard normal distributions, and we assumed a blockwise correlation structure for inline image. Variables associated with Y include inline image to inline image, which are correlated with each other, and inline image to inline image, which are independent of each other. Additional details are found in Web Appendix C.

The lower section of Table 2 summarizes the results for the two-stage and simultaneous procedures. The strategy that penalizes interaction effects of pre-selected main effects (Pen2/EN2) exhibits the best performance in variable selection and prediction. In additional simulations (not shown) we varied the cohort size and nested case-control study size and EN2 and Pen2 maintained their optimal performance. As expected, the AUC's increased relative to the true values with increasing sample size.

6. Application to MGH HAP in Stroke Study

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

There is inconsistent evidence about the potential association between lesion location and post-stroke hospital acquired pneumonia (HAP) (Hilker et al., 2003; Upadya et al., 2004). The MGH HAP study was undertaken to identify specific brain regions of acute infarction that are linked to HAP in acute ischemic stroke patients and to formulate a prediction rule that could be used for early evaluation of the risk of HAP in newly admitted stroke patients.

6.1. Variable Selection

The potential predictors include both clinical and neuroimaging variables. Clinical variables include dyslipidemia, smoking history, coronary artery disease, diabetes mellitus, atrial fibrillation and hypertension. Raw images based on either non-contrast computed tomography (CT) scans or diffusion weighted magnetic resonance imaging (MRI-DWI) were obtained soon after onset of stroke symptoms. After co-registering raw images using specialized software, the infarct lesion maps were subsegmented into 69 pairs of mirrored cortical and subcortical regions based on “Harvard-Oxford cortical structural” and “JHU DTI-based white-matter” atlases. The regional percentage of infarcted tissue for each patient was then determined for each of the 138 standardized regions. We dichotomized the percentage of infarction at each brain region using its median. We categorized infarct volume according to the tertiles of its distribution. To stabilize the numerical calculations, we excluded the 16 brain regions for which fewer than 5% of cases or 5% of controls displayed positive infarction.

We compared the performance of the proposed lasso and elastic net based strategies with stepwise regression and logic regression. The elastic net potentially advantageous over the lasso when some true predictors are highly correlated, as are some of the imaging variables, and it is of scientific interest to identify them all.

For variable selection using the simultaneous procedure, we need to estimate the offset inline image defined in Section 2. This estimate is also necessary to estimate the prediction model based on the two-stage procedure. We have two choices for estimation of this quantity: we may estimate it directly, or we may decompose it according to Bayes’ theorem and estimate the components of it separately. We illustrate the latter approach, as it is preferable when there are external estimates available for any of the components, or if any of the components may be particularly well-estimated using the prospective parent study. By Bayes’ theorem, inline image. The prevalence rate of HAP in acute stroke patients, inline image, is estimated to be 12.2%. Estimation of inline image is based on the available disease status and matching variable information for the 1,851 patients from among the 1915 in the cohort study who are not missing any of these variables. Among the 1,851 patients, age ranges from 11 years to 103 years, and NIHSS ranges from 0 to 36. To estimate inline image empirically, we create a nearest neighbor window for the ith observed subject in cohort data as inline image, inline image and count the number of subjects with and without HAP falling within the window. Details of algorithm are provided in Web Appendix D. This nonparametric approach to estimation of inline image may suffer the “curse of dimensionality” when inline image is high dimensional. In this situation, we recommend direct estimation of inline image, that is, via inline image, but based on a model, such as logistic regression.

Table 4 summarizes 10-fold cross-validation results for the different variable selection methods, under both procedures. Stepwise conditional logistic regression has considerably worse performance than other methods. Pen3 exhibits the highest cross-validated log-likelihood among Pen1, Pen2 and Pen3. We additionally evaluated Pen3 with the elastic net penalty function replacing the lasso. We considered a sequence of different inline image penalties (inline image) along with the internal cross-validation procedure for the inline image component of the elastic net. We selected the minimum inline image that yielded a cross-validated log-likelihood that was within 3% of the maximum, to satisfy the accompanying goal of a parsimonious model. Among all applications of the elastic net, the use of the elastic net penalty function on the main effects and the lasso penalty on the interactions performed the best.

Table 4. MGH stroke imaging data: 10-fold cross-validated (conditional) log-likelihood for different variable selection methods (excluded imaging variables with nonzero values less than 5% in either cases or controls)
 StepwiseLogic RegPen1Pen2Pen3Elastic Net
Two-stage procedure−1230.30−149.50−145.83−128.80−120.07−104.02
Simultaneous procedure−383.87−276.43−302.24−277.12−268.36−245.50

Based on our simulation results, it is reasonable to choose the variable selection method with the highest cross-validated (conditional) log-likelihood, as it typically leads to the best prediction accuracy. As seen in Table 5, among the 10 main effects chosen by the two-stage and simultaneous procedures with lasso penalty, eight are shared in common. Region j26 is selected by the two-stage procedure only and h72 is selected by the simultaneous procedure only. One possible reason that the two-stage procedure does not select h72 is that h72 is highly correlated with h32, which was selected by the two-stage procedure. Similarly, the simultaneous procedure does not select j26, which is highly correlated with j34, and was selected by the simultaneous procedure. We list the 20 selected interaction effects selected by Pen3 under the two-stage procedure and the 24 selected interaction effects selected under the simultaneous procedure in the Web Appendix E. Fifteen of these terms are common to both procedures. As expected, the elastic net strategy selects more main effects than the lasso, and all of the main effects selected by lasso are also selected by the elastic net. The elastic net strategy under the two-stage procedure and simultaneous procedure selected 15 and 31 interactions respectively; these are listed in the Web Appendix E.

Table 5. Names of selected main effects by two procedures with lasso or elastic net penalty
  Two-stageSimultaneous
  procedureprocedure
    
CodeName(lasso)(EN)(lasso)(EN)
  1. inline imageNote: EN, elastic net; L, left side of brain; R, right side of brain.

CADCoronary artery disease (Y/N)   inline image
j10Cerebral peduncle Rinline imageinline imageinline imageinline image
j11Anterior limb of internal capsule L   inline image
j19Superior corona radiata L   inline image
j23Posterior thalamic radiation L inline image  
j24Posterior thalamic radiation R inline image  
j26Sagittal stratum Rinline imageinline image inline image
j34Fornix (cres)/Stria terminalis Rinline imageinline imageinline imageinline image
j40Uncinate fasciculus R inline image  
h5Superior Frontal Gyrus Linline imageinline imageinline imageinline image
h8Middle Frontal Gyrus Rinline imageinline imageinline imageinline image
h21Middle Temporal Gyrus—anterior division L   inline image
h28Inferior Temporal Gyrus—anterior division R   inline image
h25Middle Temporal Gyrus—temporooccipital part L inline image  
h32Inferior Temporal Gyrus—temporooccipital part Rinline imageinline imageinline imageinline image
h43Lateral Occipital Cortex—superoir division L   inline image
h45Lateral Occipital Cortex—inferior division L inline image  
h46Lateral Occipital Cortex—inferior division R inline image inline image
h51Juxtapositional Lobule Cortex Linline imageinline imageinline imageinline image
h72Lingual Gyrus R inline imageinline imageinline image
h74Temporal Fusiform Cortex—anterior division R inline image inline image
vol1Volume great than or equal to its 33% percentileinline imageinline imageinline imageinline image
vol2Volume great than or equal to its 67% percentileinline imageinline imageinline imageinline image

Several of the brain regions selected have plausible associations with the risk of pneumonia. As a brainstem motor control center, infarctions in the cerebral peduncle (j10) would be expected to impair motor control of swallowing and increase the risk of pneumonia. Notably, functional MRI studies of healthy individuals have shown activation of the middle and inferior frontal gyri, the cingulate cortex (adjacent to fornix and stria terminals), the insular cortex, the superior and transverse temporal gyri during different swallowing activation tasks (Martin et al., 2001).

Figure 2 displays a heatmap of the phi coefficient (a correlation coefficient for binary data) matrix plot for the imaging variables and infarct volume. Black represents perfect positive or negative correlation, and white represents no correlation. The darker the grey, the higher the association. A block correlation structure is apparent, likely reflecting high local spatial correlations in the brain. The bars on the left depict the univariate log odds ratios and associated p-values for each variable relative to the HAP outcome based on a conditional logistic regression model that accounts for the matching: black depicts the highest positive or negative association inline image inline image, white depicts no association inline image inline image; black depicts the smallest p-value=0.001 and white depicts the largest p-value = 1. The bold ticks on the axes of the correlation matrix mark the variables that were selected in the two-stage lasso version of Pen3. It is apparent that our proposed lasso procedure selects variables with high univariate association and does not select redundant variables. The bold ticks below the x-axis labels mark the variables that were selected by the elastic net method embedded in the two-stage procedure. As expected, the elastic net selects additional main effects, which are correlated with those selected by lasso. The heatmap plot for simultaneous procedure is similar and is not shown here.

image

Figure 2. Heatmap of the correlations among the imaging variables and infarct volume (black is perfect positive or negative correlation, white is no correlation; the darker the grey, the higher the association). The bars on the left depict the univariate log odds ratios and p-values for each imaging variable relative to the HAP outcome based on a conditional logistic regression model that accounts for the matching: black depicts the highest positive or negative association inline image inline image, white depicts no association inline image inline image; black depicts the smallest p-value = 0.001 and white depicts the largest p-value = 1. The bold ticks on the axes of the correlation matrix mark the variables that were selected in the two-stage lasso version of Pen3. The bold ticks below the x-axis labels mark the variables that were selected by the elastic net method embedded in the two-stage procedure. The last two ticks correspond to two indicator variables of infarct volume.

Download figure to PowerPoint

Finally, using the elastic net procedure, EN3, we fit an unconditional logistic regression model that includes inline image as the offset and we simultaneously estimated inline image, inline image and inline image The estimated coefficients are listed in Appendix E and could be used for prediction of HAP for a future patient.

7. Discussion

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

We have proposed two variable selection and prediction procedures for high dimensional data from a matched case-control study that is nested within a larger prospective study. In our context, the predictor variables of interest are available only for the nested study participants. To appropriately account for the matching, which is growing in popularity for high dimensional imaging and genomic studies, we take a likelihood-based approach. We further employ penalization to stabilize the estimation and to facilitate variable selection. Being likelihood based, our procedures can be extended to other clustered designs. Although we have not considered interactions between the matching variables and the predictor variables, these could be included as well. Our ability to go beyond variable selection to prediction hinges on the availability of the parent prospective study, from which we can estimate the proper adjustment to the log odds parameter to obtain the correct population outcome model.

We have compared the performances of our proposed two-stage and simultaneous procedures in simulation studies and for the MGH HAP study. While the variable selection and prediction performances are similar for the two procedures, the simultaneous procedure demonstrated marked computational efficiency over the two-stage procedure. In our simulation studies, the computational time of the two-stage procedure ranged from 13 to 35 times the computational time of the simultaneous procedure. The potential disadvantage of the simultaneous procedure is that it requires correct modeling of the predictor variables as well as the matching variables for variable selection, whereas the two-stage procedure requires only correct modeling of the predictor variables for variable selection. This is because the two-stage procedure removes the matching variables from the predictor variable selection step through its use of the conditional likelihood.

Wu et al. (2010) proposed a lasso penalization procedure called “screen and clean” for the purpose of identifying interactions in case-control genome-wide association studies. This procedure is similar to our Pen2 method, however, it additionally includes a final step to “clean” the selected interactions using an independent source of data. It will be useful to extend this procedure to accommodate matched study designs and for prediction when both an independent matched case-control study and a parent prospective study are available.

In conclusion, we have developed proposals for the dual goals of variable selection and prediction in the setting of high dimensional predictors that are too expensive to measure on all members of a prospective study, necessitating use of a nested, matched case-control study. This scenario arises commonly in the context of imaging and genomic variables. We have exhibited our approach on the MGH study of hospital acquired pneumonia among acute stroke patients. Based on this study, we have identified nine non-redundant, biologically plausible brain regions, together with volume, and a subset of their interactions. This may lead to selective prophylactic treatment and ultimately serve to minimize this detrimental outcome in this vulnerable population.

Acknowledgments

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

The authors are grateful to the editor, the associate editor, and two anonymous referees for their valuable comments. This research was supported in part by the Harvard NeuroDiscovery Center, the Harvard Catalyst (8UL1TR000170), NIH grant R01-CA075971.

References

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information
  • Adewale, A. J., Dinu, I., and Yasui, Y. (2010). Boosting for correlated binary classification. Journal of Computational and Graphical Statistics 19, 140153.
  • Balasubramanian, R., Houseman, E., Coull, B., Lev, M., Schwamm, L., and Betensky, R. (in press). Variable importance in matched case-control studies in settings of high dimensional data. Journal of the Royal Statistical Society, Series C.
  • Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research: The Analysis of Case-control Studies, Vol. 1. Lyon, France: International Agency for Research on Cancer.
  • Cai, J., Fan, J., Li, R., and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika 92, 303316.
  • Davatzikos, C., Resnick, S., Wu, X., Parmpi, P., and Clark, C. (2008). Individual patient diagnosis of AD and FTD via high-dimensional pattern classification of MRI. NeuroImage 41, 12201227.
  • Davenport, R., Dennis, M., Wellwood, I., and Warlow, C. (1996). Complications after acute stroke. Stroke 27, 415420.
  • Duchesne, S., Caroli, A., Geroldi, C., Collins, D. L., and Frisoni, G. B. (2009). Relating one-year cognitive change in mild cognitive impairment to baseline MRI features. NeuroImage 47, 13631370.
  • Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20, 101148.
  • Hilker, R., Poetter, C., Findeisen, N., Sobesky, J., Jacobs, A., Neveling, M., et al. (2003). Nosocomial pneumonia after acute stroke: Implications for neurological intensive care medicine. Stroke 34, 975981.
  • Katzan, I. L., Dawson, N. V., Thomas, C. L., Votruba, M. E., and Cebul, R. D. (1998). The cost of pneumonia after acute stroke. Neurology 68, 19381943.
  • Lin, D. Y. and Ying, Z. (1993). Cox regression wigh incomplete covariate measurements. Journal of the American Statistical Association 88, 13411349.
  • Mantel, N. (1973). Synthetic retrospective studies and related topics. Biometrics 29, 479486.
  • Martin, R., Goodyear, B., Gati, J., and Menon, R. (2001). Cerebral cortical representation of automatic and volitional swallowing in humans. Journal of Neurophysiology 85, 938950.
  • Meier, L., van de Geer, S., and Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B 70, 5371.
  • Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403412.
  • Rothman, K. J., Greenland, S., and Lash, T. L. (2008). Modern Epidemiology, 3rd edition. Philadelphia, PA: Lippincott Williams & Wilkins.
  • Ruczinski, I., Kooperberg, C., and Leblanc, M. (2003). Logic regression. Journal of Computational and Graphical Statistics 12, 475511.
  • Tan, Q., Thomassen, M., and Kruse, T. A. (2007). Feature selection for predicting tumor metastases in microarray experiments using paired design. Cancer Informatics 3, 213218.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267288.
  • Tworoger, S., Rosner, B., Willett, W., and Hankinson, S. (2011). The combined influence of multiple sex and growth hormones on risk of postmenopausal breast cancer: A nested case-control study. Breast Cancer Research 13, R99.
  • Upadya, A., Thorevska, N., Sena, K., Manthous, C., and Amoateng-Adjepong, Y. (2004). Predictors and consequences of pneumonia in critically ill patients with stroke. The Journal of Critical Care 19, 1622.
  • Woodward, M. (2005). Epidemiology: Study Design and Data Analysis, 2nd edition. Boca Raton, FL: Chapman & Hall/CRC.
  • Wu, J., Devlin, B., Ringquist, S., Trucco, M., and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genetic Epidemiology 34, 275285.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714721.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67, 301320.

Supporting Information

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Notation, Models and Sampling
  5. 3. Two-Stage Procedure for Variable Selection and Prediction
  6. 4. Simultaneous Variable Selection and Prediction Procedure
  7. 5. Simulation Studies
  8. 6. Application to MGH HAP in Stroke Study
  9. 7. Discussion
  10. 8. Supplementary Materials
  11. Acknowledgments
  12. References
  13. Supporting Information

All Supplemental Data may be found in the online version of this article.

FilenameFormatSizeDescription
biom12113-sup-0001-SuppData-S1.pdf248KSupplementary Materials.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.