Feature selective temporal prediction of Alzheimer's disease progression using hippocampus surface morphometry

Abstract Introduction Prediction of Alzheimer's disease (AD) progression based on baseline measures allows us to understand disease progression and has implications in decisions concerning treatment strategy. To this end, we combine a predictive multi‐task machine learning method (cFSGL) with a novel MR‐based multivariate morphometric surface map of the hippocampus (mTBM) to predict future cognitive scores of patients. Methods Previous work has shown that a multi‐task learning framework that performs prediction of all future time points simultaneously (cFSGL) can be used to encode both sparsity as well as temporal smoothness. The authors showed that this method is able to predict cognitive outcomes of ADNI subjects using FreeSurfer‐based baseline MRI features, MMSE score demographic information and ApoE status. Whilst volumetric information may hold generalized information on brain status, we hypothesized that hippocampus specific information may be more useful in predictive modeling of AD. To this end, we applied a multivariate tensor‐based parametric surface analysis method (mTBM) to extract features from the hippocampal surfaces. Results We combined mTBM features with traditional surface features such as middle axis distance, the Jacobian determinant as well as 2 of the Jacobian principal eigenvalues to yield 7 normalized hippocampal surface maps of 300 points each. By combining these 7 × 300 = 2100 features together with the previous ~350 features, we illustrate how this type of sparsifying method can be applied to an entire surface map of the hippocampus that yields a feature space that is 2 orders of magnitude larger than what was previously attempted. Conclusions By combining the power of the cFSGL multi‐task machine learning framework with the addition of AD sensitive mTBM feature maps of the hippocampus surface, we are able to improve the predictive performance of ADAS cognitive scores 6, 12, 24, 36 and 48 months from baseline.


| INTRODUCTION
Recent work in psychological testing (Caselli et al., 2013), genetic studies (Elias-Sonnenschein et al., 2013), magnetic resonance (MR) imaging (Teipel et al., 2013), positron emission tomography (PET) imaging (Becker et al., 2013), cerebral spinal fluid (CSF) measurements (Blennow & Zetterberg, 2013), cardiovascular status (Hajjar, Brown, Mack, & Chui, 2013) and others have yielded tremendous amounts of diagnostic data for diagnosing and staging dementias, especially Alzheimer's disease (AD). Moreover, many of these studies now also include longitudinal information (Caselli et al., 2013;Mueller et al., 2005). This has led to a problem often referred to as the 'curse of dimensionality', where the size (number of dimensions) of the dataset makes it difficult to perform numerical analyses on the data. This in turn makes it increasingly difficult to draw consistent conclusions from the dataset. Traditional approaches to dimension reduction eliminates variables / dimensions based on clinical assumptions and allows us to test specific hypothesis about the disease model. However, it does not lend itself to discovering new correlations or allow for all inclusive models that are consistent across all dimensions. These problems become even more important when trying to improve predictions using machine learning techniques. This is mainly because at a point the predictive power of the model ceases to increase by just adding more information or dimensions. The question is then about how to select the "correct" features to maximize predictive power. Zhou, Liu, Narayan, Ye, and Ye (2013) outlines a method that simultaneously enforces low dimensionality through sparsity of weights and temporal smoothness of the predicted behavioral scores at 6, 12, 24, 36 and 48 months. This paper leverages this method, built specifically for progressive disease models, such as AD, together with multivariate tensor-based morphometric (mTBM) features (Wang, Yuan, et al., 2010) of the hippocampus to predict AD progression up to 48 months from the baseline MRI measurement. The goal is to evaluate the predictive power of mTBM against those of cortical thickness and other FreeSurfer-based features, demographic information (sex and age) as well as genetic information (ApoE-ε4 Copies).
At the same time, the machine learning community recognized the utility in predicting disease progression as a means of characterizing AD disease progression. It allows for an inclusive look at how the different diagnostic indicators account for observed changes.
However, combining cFSGL with a more AD specific / sensitive features such as surface deformations fields of the hippocampus might improve the predictive power of the algorithm significantly. To this end, we augmented the generic FreeSurfer-based image features with novel mTBM features of the hippocampus and other surface deformation field based features (see Table 1 for features), which significantly increased the predictive power of the cFSGL technique. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI-GO and ADNI-2. ADNI-GO or "Grand Challenges" and ADNI-2 supplements ADNI by trying to identify patients in the pre-dementia or early mildly cognitively impaired (eMCI) phase. To date these three protocols have recruited over 1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow up duration of each group is specified in the protocols for ADNI-1, ADNI-2 and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date informa-  (Zhou et al., 2013).

| Freesurfer MRI features
The MRI image analysis software Freesurfer (Fischl, 2012) was used to extract 305 MRI features based on cortical reconstruction and volumetric segmentations. The features can be group into 5 categories: average cortical thickness, standard deviation in cortical thickness, the volumes of cortical parcellations (based on regions of interest automatically segmented in the cortex), the volumes of specific white matter parcellations, and the total surface area of the cortex. This process was performed by the ADNI team at UCSF under the ADNI harmonized MRI processing protocols as outlined on their website (http:// adni.loni.usc.edu/methods/mri-analysis/). See Table 1 for a more complete feature list and breakdown.

| Hippocampus surface computation
The details of the entire methodology of extracting mTBM features from surface registered hippocampal maps is outlined in Shi, , we have outlined the key steps of the method T A B L E 1 List of original features from (Zhou et al., 2013) and new surface features (downsized by 10) computed from the hippocampus used to predict outcomes at 6, 12, 24, 36 and 48 months  (Han, Xu, & Prince, 2003) to ensure the segmentation was topological correct before tessellation via a marching cubes algorithm (Lorensen & Cline, 1987).

| Conformal representation and surface registration of the hippocampus
In order for discretized imaging data to be used in group analysis and prediction tasks, they must be transformed into a common space that allows for one-to-one correspondence across subjects. Examples of the mean hippocampal common space can be seen in Figure 1. In our case, we would like to use measurements on a discretized surface represented my vertices in ℝ 3 and edges between the vertices.
In this case, we first conformally mapped the hippocampal surface onto a rectangular plannar surface using holomorphic 1-forms. The surface conformal representation is then computed using the local conformal factor as well as mean curvature. The dynamic range of the conformal representation is then linearly scaled to form the feature image of the surface. The feature image aligned with a template image via fluid registration in a curvilinear coordinate system that compensates for distortions due to the conformal parameterization ). There are numerous advantages of using conformal representation with fluid registration to align the hippocampal surfaces: (1) the entire transform is diffeomorphic and therefore has diffeomorphic shape correspondences that are smooth and one-to-one.
(2) The transform is inverse consistent and therefore more robust than unidirectional transformations (Leow et al., 2005). (3) Because conformal parametrization induces a simple Riemannian metric, the Navier-Stokes equation in the fluid registration can be easily adjusted for area distortion (Wang, Chiang, & Thompson, 2005a,b).

| Multivariate tensor-based morphometry (mTBM)
After automatically segmenting hippocampus with FSL (Jenkinson et al., 2012) from brain MR images, we build parametric meshes to model hippocampal shapes. High-order correspondences between hippocampal surfaces were enforced across subjects with a novel inverse consistent surface fluid registration method. Multivariate statistics consisting of multivariate tensor-based morphometry (mTBM) and radial distance were computed for surface deformation analysis Wang, Yuan, et al., 2010).
Multivariate tensor-based morphometric (mTBM) analysis has been used as a sensitive method of comparing deformation fields of different subjects with the aim of discovering group-wise differences Wang, Zhang, et al., 2010). mTBM generates Riemannian manifolds from the full deformation fields that map each subject to the template space and statistics are computed on these manifolds. Specifically, compared to univariate TBM which uses the Jacobian of the transformation that mainly describes the volumetric changes, mTBM uses the full deformation information by applying a manifold version of Hotelling's test to Riemannian manifolds in log-euclidean space. The idea is to be able to describe higher order transformations with a single metric instead of using derived metrics from the Jacobian (see Figure 1 for examples of mTBM features).  showed that a surface derived from a reasonable segmentation using FSL is sensitive enough to detect group-wise differences in the mTBM features. Moreover, mTBM is also more statistically sensitive with better power as shown by false discovery rates . In this work, we've added these sensitive features to the existing MR-based surface area and volumetric features to boost AD prediction accuracy. Zhou et al. (2013) proposed a powerful multi-tasked learning technique that incorporates sparsity as well as temporal smoothing for modeling a progressive disease model. In their formulation, each tasked can be though of a single forward predictor from baseline measurement to a measurement at a certain future time point. In their case, they used the ADNI dataset and predicted ADAS cognitive scores 6 months after baseline (M06), 12 months after baseline (M12), 24 months after baseline (M24), 36 months after baseline (M36) and 48 months after baseline (M48). In our study we aim to use the same ADNI dataset but also incorporate 7 hippocampus surface feature maps of 300 points (2100 features total) and compare it to the predictive performance of using only simple regional volumes and surface areas used (305 features total) in their study.

| Convex fused sparse group lasso
The cFSGL method that we use can be considered a multi-task regression problem with t time points and from n subjects each with d features, where x 1 , x 2 , … , x n represents each of the d input features for each subject at baseline (i.e. x i ∈ ℝ d ). Similarly, y 1 , y 2 , … , y n represents the target cognitive scores for each subject at T time points (i.e. y i ∈ ℝ t ). For a single subject (n), each task can be seen as a projection of MR / demographic / genetic baseline measurements at t = 0 represented as x n to a future cognitive score measurement at time t = t 1 (e.g. at 48 months) given by the appropriate row in vector y n . We can extend this formulation to a multi-task one by performing projections of all time points simultaneously. In other words, each set of baseline measurements for a single subject at t = 0 given by x 1 (ℝ d with d features) is projected to a vector (ℝ t with T time points) given by y 1 . The entire population-based mapping can be summarized as a linear operation using matrices X and Y. X and Y are formed by arranging the input and output patient feature space row-wise, each row being x n or y n , (i.e. X = x 1 , x 2 , … , x n T ,Y = y 1 , y 2 , … , y n T ) and yield a X ∈ ℝ n×d matrix and a Y ∈ ℝ n×t matrix. Since this is a linear model, a set of weights W = w 1 , w 2 , … , w t T ∈ ℝ d×t is trained to map x n to y n or X to Y.
To achieve a set of weights that encodes both sparsity and temporal smoothness, the following cost function is minimized during training: where ||W 1 || is the L1-norm or lasso penalty that encodes for sparsity, ij is the group Lasso penalty that encodes for temporal grouping of features.
||RW T || 1 is the fused lasso penalty as defined by R = H T , where: this term encodes for temporal smoothness (Zhou et al., 2013).

| RESULTS
Predictions using hippocampus-based feature maps outperform prediction without using feature maps as shown by quantitative measures such as nMSE, wR and rMSE. This was true across the board at all time points (see Table 2 and Figure 2). Our results show that incorporating large feature maps into sparsifying prediction tasks is not only possible but may improve results of the prediction.
The results shown are from 2 simulation experiments where data from ADNI was used to both train and test the cFSGL model.  Table 2 shows how predictive performance has improved by incorporating hippocampus surface features into our dataset. There were improvements in predicting behavior outcomes at every time point.
Moreover, by looking at the weights in predicting the behavioral outcomes, we may able to see which parts of the hippocampus feature maps are often used in predicting behavior. Figures 3 and 4 show that the raw prediction results from our multiple cross validation runs are reasonably distributed. These results were then used to calculate the different predictive performance measures such as Mean Square Error.

| DISCUSSION AND CONCLUSIONS
By merging fused multi-task learning that encodes temporal smoothing (Zhou et al., 2013) together with AD sensitive mTBM maps of the parametric hippocampus surface , we were able to get significant gains in future ADAS cognitive score prediction. These results are some of the highest performing predictions based on baseline data only and is consistent with our survey of other comparable studies (Zhou et al., 2013). There are two main findings in our work. First, we demonstrate surface mTBM when combined with other features, may significantly boost the statistical powers.
This discovery is in line with many of our prior studies Wang et al., 2013;Shi et al., 2014). The newly combined surface statistics practically encodes a great deal of neighboring intrinsic geometry information that would otherwise With proper tuning of parameters to match the features size, the sparsity constraint was also able to prevent overfitting, which tends to occur when using large number of features. Our work shed some light to future work to predict longitudinal neuropsychological changes and may help solve this challenging research problem.
One factor not addressed in this work is the effect of percentage of data used for training and testing. Previous work (Zhou et al., 2013) has shown that although there would be a decrease in performance measured with a smaller training set, the trends and relative performance remains comparable. We have also treated the parametric sur-  Actual ADAS Cog Predicted ADAS Cog M48 we intend to explore is the use of stability selection in seeding the initial weights for the algorithm in a hierarchical approach to learning.
We believe that this a reasonable way of leveraging prior information whilst allowing the algorithm to impose explore ensure temporal smoothness and sparsity.
As this is a model of an epidemiological system, we cannot ignore the investigator's selection of reasonable features. Moreover, the performance of the system is as interesting as the weights that yield the predictions.

| Future Work
Our future work includes understanding the behavior of the weights across the parametric surface space as well as in time. Previous work has shown that stability selection may be a good fit for analyzing the

CONFLICT OF INTEREST
None declared.