There is inconsistent evidence about the potential association between lesion location and post-stroke hospital acquired pneumonia (HAP) (Hilker et al., 2003; Upadya et al., 2004). The MGH HAP study was undertaken to identify specific brain regions of acute infarction that are linked to HAP in acute ischemic stroke patients and to formulate a prediction rule that could be used for early evaluation of the risk of HAP in newly admitted stroke patients.

#### 6.1. Variable Selection

The potential predictors include both clinical and neuroimaging variables. Clinical variables include dyslipidemia, smoking history, coronary artery disease, diabetes mellitus, atrial fibrillation and hypertension. Raw images based on either non-contrast computed tomography (CT) scans or diffusion weighted magnetic resonance imaging (MRI-DWI) were obtained soon after onset of stroke symptoms. After co-registering raw images using specialized software, the infarct lesion maps were subsegmented into 69 pairs of mirrored cortical and subcortical regions based on “Harvard-Oxford cortical structural” and “JHU DTI-based white-matter” atlases. The regional percentage of infarcted tissue for each patient was then determined for each of the 138 standardized regions. We dichotomized the percentage of infarction at each brain region using its median. We categorized infarct volume according to the tertiles of its distribution. To stabilize the numerical calculations, we excluded the 16 brain regions for which fewer than 5% of cases or 5% of controls displayed positive infarction.

We compared the performance of the proposed lasso and elastic net based strategies with stepwise regression and logic regression. The elastic net potentially advantageous over the lasso when some true predictors are highly correlated, as are some of the imaging variables, and it is of scientific interest to identify them all.

For variable selection using the simultaneous procedure, we need to estimate the offset defined in Section 2. This estimate is also necessary to estimate the prediction model based on the two-stage procedure. We have two choices for estimation of this quantity: we may estimate it directly, or we may decompose it according to Bayes’ theorem and estimate the components of it separately. We illustrate the latter approach, as it is preferable when there are external estimates available for any of the components, or if any of the components may be particularly well-estimated using the prospective parent study. By Bayes’ theorem, . The prevalence rate of HAP in acute stroke patients, , is estimated to be 12.2%. Estimation of is based on the available disease status and matching variable information for the 1,851 patients from among the 1915 in the cohort study who are not missing any of these variables. Among the 1,851 patients, age ranges from 11 years to 103 years, and NIHSS ranges from 0 to 36. To estimate empirically, we create a nearest neighbor window for the *i*th observed subject in cohort data as , and count the number of subjects with and without HAP falling within the window. Details of algorithm are provided in Web Appendix D. This nonparametric approach to estimation of may suffer the “curse of dimensionality” when is high dimensional. In this situation, we recommend direct estimation of , that is, via , but based on a model, such as logistic regression.

Table 4. MGH stroke imaging data: 10-fold cross-validated (conditional) log-likelihood for different variable selection methods (excluded imaging variables with nonzero values less than 5% in either cases or controls) | Stepwise | Logic Reg | Pen1 | Pen2 | Pen3 | Elastic Net |
---|

Two-stage procedure | −1230.30 | −149.50 | −145.83 | −128.80 | −120.07 | −104.02 |

Simultaneous procedure | −383.87 | −276.43 | −302.24 | −277.12 | −268.36 | −245.50 |

Based on our simulation results, it is reasonable to choose the variable selection method with the highest cross-validated (conditional) log-likelihood, as it typically leads to the best prediction accuracy. As seen in Table 5, among the 10 main effects chosen by the two-stage and simultaneous procedures with lasso penalty, eight are shared in common. Region j26 is selected by the two-stage procedure only and h72 is selected by the simultaneous procedure only. One possible reason that the two-stage procedure does not select h72 is that h72 is highly correlated with h32, which was selected by the two-stage procedure. Similarly, the simultaneous procedure does not select j26, which is highly correlated with j34, and was selected by the simultaneous procedure. We list the 20 selected interaction effects selected by Pen3 under the two-stage procedure and the 24 selected interaction effects selected under the simultaneous procedure in the Web Appendix E. Fifteen of these terms are common to both procedures. As expected, the elastic net strategy selects more main effects than the lasso, and all of the main effects selected by lasso are also selected by the elastic net. The elastic net strategy under the two-stage procedure and simultaneous procedure selected 15 and 31 interactions respectively; these are listed in the Web Appendix E.

Several of the brain regions selected have plausible associations with the risk of pneumonia. As a brainstem motor control center, infarctions in the cerebral peduncle (j10) would be expected to impair motor control of swallowing and increase the risk of pneumonia. Notably, functional MRI studies of healthy individuals have shown activation of the middle and inferior frontal gyri, the cingulate cortex (adjacent to fornix and stria terminals), the insular cortex, the superior and transverse temporal gyri during different swallowing activation tasks (Martin et al., 2001).