SEARCH

SEARCH BY CITATION

Keywords:

  • BLUPs;
  • Kernel function;
  • Model/variable selection;
  • Nonparametric regression;
  • Penalized likelihood;
  • REML;
  • Score test;
  • Smoothing parameter;
  • Support vector machines

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

Summary We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

Analysis of microarray data has been mainly focused on detection of individually significantly expressed genes (Efron et al., 2001; Tusher, Tibshirani, and Chu, 2001). This approach has some major limitations: (1) long list of individually significant genes without a single encompassing theme is difficult to interpret; (2) cellular processes often affect sets of genes and individually highly ranked genes are often downstream genes, so moderate changes in many genes may give more insight into biological mechanisms than dramatic change in a single gene (Mootha et al., 2003); (3) individually highly ranked genes can be poorly annotated and are often not reproducible across studies (Fortunel et al., 2003). Researchers have now become more interested in knowledge-based studies on gene sets, for example, genetic pathways that are more biologically interpretable and reproducible (Goeman et al., 2005; Subramanian et al., 2005).

A data example motivating the proposed research is the data from the Michigan prostate cancer study (Dhanasekaran et al., 2001). Prostate-specific antigen (PSA) has been routinely used as a biomarker for screening prostate cancer. Recently there have been significant breakthroughs in the effort of finding candidate genes related to prostate cancer. The early results of Dhanasekaran et al. (2001) indicate that certain functional genetic pathways seemed dysregulated in prostate cancer relative to noncancerous tissues. One is interested in studying the genetic pathway effects on PSA after adjusting for effects of clinical and demographic covariates. Due to the complicated unknown relationships between genes and PSA, we propose a flexible framework to model the genetic pathway effect parametrically or nonparametrically.

There is a vast literature on multidimensional nonparametric modeling. Methods such as multivariate kernel smoothing (Wand and Jones, 1995), projection pursuit regression (Friedman and Stuetzle, 1981), and multivariate adaptive regression splines (MARS) (Friedman, 1991), are usually computationally expensive. Popular spline-based methods include generalized additive models (GAMs) (Hastie and Tibshirani, 1990), thin-plate splines (Wahba, 1990; Green and Silverman, 1994), penalized regression splines (Ruppert, Wand, and Carroll, 2004), and smoothing spline ANOVA (Gu, 2002). These methods require the specification of the smoothness condition of an unknown function using differentiability conditions, which is much more involved and awkward in multidimensional settings.

In the past decade, the kernel machine method has been developed in machine learning as a powerful learning technique for multidimensional data (Vapnik, 1998; Schölkopf and Smola, 2002; Suykens et al., 2002; Rasmussen and Williams, 2006). Popular examples of kernel machine methods include support vector machine (SVM) (Vapnik, 1998) and Bayesian Gaussian process (Rasmussen and Williams, 2006). In the context of function approximation, kernel machine methods and spline-based methods share a similar theoretical foundation, but their model-fitting philosophies are different. Kernel machine methods start with a kernel function that implicitly determines the smoothness property of the unknown function. By contrast, spline-based methods start with the smoothness conditions of the unknown function and a corresponding kernel function can usually be derived from these conditions (Wahba, 1990). Kernel machine methods hence greatly simplify specification of a nonparametric model, especially for multidimensional data.

In this article, we propose a semiparametric model for covariate and genetic pathway effects on a continuous outcome (e.g., PSA), where covariates effects are modeled parametrically and genetic pathway effect is modeled parametrically or nonparametrically using least-squares kernel machine (LSKM). We establish a connection between LSKM and linear mixed models, and show that the LSKM estimator of the regression coefficients and the pathway effect can be obtained by fitting a linear mixed model. This connection provides a unified framework for inference of parameters in models with multidimensional covariates, including the regression coefficients, the nonparametric function, and smoothing parameters. Our work extends the connection between univariate smoothing splines and linear mixed models (Speed, 1991; Wang, 1998; Zhang et al., 1998) to multivariate smoothing with an arbitrary kernel function. We also propose a score test to test for the nonparametric genetic pathway effect, and a model/variable selection method within the LSKM framework.

The rest of the article is organized as follows. In Section 2, we present the semiparametric model for Gaussian outcomes. In Section 3, we describe the LSKM method. In Section 4, we establish a connection between LSKMs and linear mixed models and propose a score test for testing for the genetic pathway effect. We discuss the variable selection problem in LSKM in Section 5. The performance of the proposed method is evaluated by simulations in Section 7, and is illustrated using the prostate cancer microarray data in Section 6. The article ends with a discussion in Section 8.

2. Semiparametric Model for Multidimensional Data

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

2.1 The Model

Suppose the data consist of n subjects. For subject i(i= 1, … , n), yi is a normally distributed continuous outcome, xi is a q× 1 vector of clinical covariates and zi is a p× 1 vector of gene expressions within a pathway. We assume an intercept is included in xi. The outcome yi depends on xi and zi through the following partial linear model

  • image(1)

where β is a q× 1 vector of regression coefficients, h(zi) is an unknown centered smooth function, and the errors ei are assumed to be independent and follow N(0, σ2).

Model (1) models covariate effects parametrically and the pathway effect parametrically or nonparametrically. When h(·) = 0, (1) reduces to the standard linear regression model. When xi= 1, it reduces to LSKM regression (Suykens et al., 2002).

2.2 Specifications of a Function Space of h(z) Using a Kernel

We assume the nonparametric function h(z) lies in a function space inline image generated by a positive definite kernel function K(· , ·). From Mercer's theorem (Cristianini and Shawe-Taylor, 2000), under some regularity conditions, a kernel function K(· , ·) implicitly specifies a unique function space spanned by a particular set of orthogonal basis functions (features) j(z)}Jj=1. In other words, any inline image can be represented using a set of bases as inline image (the primal representation), where ω is a vector of coefficients. Equivalently, h(z) can also be represented using a kernel function K(· , ·) as inline image (the dual representation), for some integer L, some constants αl and some {z*1, … , z*L}∈Rp. For a multidimensional z, it is more convenient to specify h(z) using the dual representation, because explicit basis functions or features might be complicated to specify, and the number of features might be high or even infinite.

Two popular kernel functions and the corresponding function spaces are as follows: (1) The dth Polynomial Kernel: K(z1, z2) = (zT1z2+ρ)d, where ρ and d are tuning parameters. The dth polynomial kernel generates the function space inline image spanned by all possible dth-order monomials of the components of z. For example, if d= 1, the first polynomial kernel generates the linear function space with basis functions j(z)}={z1, … , zp}. If d= 2, the second polynomial kernel corresponds to the quadratic function space with basis functions j(z)}={zk, zkzk} (k, k′= 1, … , p), that is, the main effects, all two way interactions and quadratic main effects of the zk's. (2)The Gaussian Kernel: K(z1, z2) = exp{−||z1z2||2/ρ}, where inline image. The Gaussian kernel generates the function space spanned by radial basis functions. See Buhmann (2003) for their mathematical properties and desirable features. Examples of other choices of kernel functions include the sigmoid and neural network kernels, and the B-spline kernel (Schölkopf and Smola, 2002). The choice of a kernel function determines which function space one would like to use to approximate h(z).

3. LSKM Estimation in the Semiparametric Model

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

Assume inline image, the function space generated by a kernel function K(· , ·). Estimation of inline image and h(·) in (1) proceeds by maximizing the scaled penalized likelihood function

  • image(2)

where λ is a tuning parameter which controls the tradeoff between goodness of fit and complexity of the model. When λ= 0, the model interpolates the gene expression data, whereas when λ=∞, the model reduces to a simple linear model without h(·).

By the Representer theorem (Kimeldorf and Wahba, 1970), the general solution for the nonparametric function h(·) in (2) can be expressed as

  • image(3)

where α= (α1, … , αn)T are unknown parameters. Substituting (3) back into (2) we have

  • image(4)

where K is an n×n matrix whose (i, j)th element is K(zi, zj). Differentiating inline image with respect to β and α, some calculations give

  • image(5)
  • image(6)

where X= (xT1, … , xTn)T and y= (y1, … , yn)T. Plugging (6) into (3), we have that the function h(·) evaluated at the design points (z1, … , zn)T is estimated as

  • image(7)

Using (3) and (6), inline image at an arbitrary z is

  • image(8)

Equivalently, if h(z) =φ(z)Tω, where j(z)} are orthogonal basis functions, the corresponding LSKM regression coefficients inline image are

  • image(9)

The kernel function K(· , ·) usually depends on an unknown parameter ρ, such as the scale parameter in Gaussian kernel. Inference on inline image depends on λ, ρ and the residual variance σ2, which need to be estimated. Cross-validation can be used to estimate λ; however, its computation is often intensive. Little literature is available on the systematic estimation of ρ and σ2. In the machine learning literature, ρ is often preset at some fixed values. Further, estimation of σ2 needs to properly account for the loss of degrees of freedom from estimating β and h(·). Hence it is desirable to develop a systematic method to estimate these parameters simultaneously. We accomplish this by establishing a connection between LSKM and linear mixed models.

4. LSKMs and Linear Mixed Models

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

4.1 Connection Between LSKMs and Linear Mixed Models

Linear mixed models have commonly been used for analyzing longitudinal and hierarchical data (Harville, 1977; Laird and Ware, 1982). A connection between smoothing splines and linear mixed models has been established (Speed, 1991; Wang, 1998; Zhang et al., 1998). We show here that the LSKM estimator in model (1) corresponds to the best linear unbiased predictor (BLUP) estimator from a linear mixed model, and the regularization parameters (τ, ρ) and the residual variance σ2 can be treated as variance components and estimated simultaneously using restricted maximum likelihood (REML).

To see this connection, simple calculations show that inline image and inline image from equations (5) and (7) can be equivalently obtained from the equations

  • image(10)

where R2I and τ=λ−1σ2. Equation (10) corresponds exactly to the normal equation of the linear mixed model

  • image(11)

where β is a q× 1 vector of regression coefficients, h is an n× 1 vector of random effects with distribution N(0, τK), and eN(0, σ2I). A comparison of (11) with model (1) indicates that they have exactly the same form except that h is now treated as random effects. It follows that the BLUPs of the regression coefficients inline image and the random effects inline image under the linear mixed model (11) correspond to the LSKM estimator given in Section 3. In fact, one can easily see that the regression coefficient estimator inline image in (5) is the weighted least-squares estimator under the linear mixed model representation (11) using the marginal covariance of y under (11) as V2IK, i.e., inline image.

The linear mixed model representation of the LSKM in the semiparametric model (1) can also be considered as a Bayesian Gaussian process regression (Schölkopf and Smola, 2002). Note that this Bayesian correspondence is finite-dimensional (Wahba, 1990; Green and Silverman, 1994). It is not strictly equivalent to a continuous Bayesian Gaussian process (Rasmussen and Williams, 2006), because the finite-dimensional representation of h(·) does not lead to a coherent Bayesian model (Green and Silverman, 1994; Tipping, 2001; Sollich, 2002. To see the Bayesian representation, we can treat {h(z)} as a random vector with a Gaussian process (GP) prior, with mean 0 and covariance cov{h(z1), h(z2)}=τK(z1, z2). Note that the positive definiteness of the kernel function K(· , ·) ensures it is a proper covariance function. Now we assume

  • image

One can easily see that under this Bayesian model, the semiparametric model (1) becomes the linear mixed model representation (11). This connection extends the connection between scalar smoothing splines and mixed models and their Bayesian formulations (Wang, 1998; Zhang et al., 1998) to multidimensional regression problems under the kernel machine framework.

The covariances of inline image and inline image can be calculated in two ways. The first approach is to treat the true h(·) as a fixed unknown function and the variance of yi as σ2. Using (5) and (7), the covariances of inline image and inline image are

  • image(12)
  • image(13)

where P=V−1V−1X(XTV−1X)−1XTV−1 and Kz={K(z, z1), … , K(z, zn)}T for an arbitrary z. We term these covariances as frequentist covariances.

The second approach is to use the linear mixed model representation (11) and treat the true h(·) as a random function following the mean zero Gaussian process with covariance τK(· , ·). The covariances of inline image and inline image can then be calculated as a byproduct of the covariance of the fixed and random effects of the linear mixed model (11) and are

  • image(14)
  • image

We term these covariances as Bayesian covariances.

4.2 Estimation of the Regularization Parameters and the Residual Variance

We discuss in this section estimation of the regularization parameter τ, the residual variance σ2 and the scale parameter ρ in K(· , ·). Using the mixed model representation of LSKM, we propose to estimate (τ, ρ, σ2) simultaneously by treating them as variance components in the linear mixed model (11) and estimating them using REML.

Specifically, the REML under the linear mixed model (11) can be written as

  • image(16)

where θ= (τ, ρ, σ2)T. The score equations of (τ, ρ, σ2) are

  • image(17)

where P=V−1V−1X(XTV−1X)−1XTV−1. Let inline image denote the hat matrix so that inline image. Using the identities inline image and P={σ2}−1(IA) (Harville, 1977), one can show using equation (17) that inline image Hence tr(A) represents the loss of degrees of freedom from estimating β and h(·) when estimating σ2. The covariance of inline image can be estimated using the information matrix of the REML likelihood inline image

4.3 Test for the Nonparametric Function

Because we are interested in the effect of a whole genetic pathway rather than individual genes, it is of significant practical interest to test H0 : h(z) = 0. In the PSA microarray example, this tests for a genetic pathway effect on PSA controlling for the effects of covariates. Assuming inline image, one can easily see from the linear mixed model representation (11) that H0 : h(z) = 0 is equivalent to testing the variance component τ as H0 : τ= 0 versus H1 : τ > 0. Note the null hypothesis places τ on the boundary of the parameter space. Because the kernel matrix K is not block diagonal, unlike the standard case considered by Self and Liang (1987), the likelihood ratio for H0 : τ= 0 does not following a mixture χ20 and χ21. We consider a score test in this article.

Zhang and Lin (2002) proposed a score test for H0 : τ= 0 to compare a polynomial model with a smoothing spline. Unlike the smoothing spline case, a general kernel function K(· , ·) in LSKM might depend on an unknown scale parameter ρ. However, for smoothing splines, K(· , ·) does not depend on any unknown parameter. One can easily see from the linear mixed model (11) that under H0 : τ= 0, the kernel matrix K disappears, and hence the scale parameter ρ disappears and becomes inestimable.

Davies (1987) studied the problem of a parameter disappearing under H0 and proposed a score test by treating the score statistic as a Gaussian process indexed by the nuisance parameter and then obtaining an upper bound to approximate the p-value of the score test. This approach, however, does not work for our setting due to the unboundedness of the parameter space.

We here propose to test for H0 : τ= 0 using the score test by fixing ρ and varying its value and examining sensitivity of the score test for H0 : τ= 0 with respect to ρ. The REML version of the score statistic of τ under H0 : τ= 0 can be written as inline image, where inline image and inline image are the MLEs of β and σ2 under the linear model yi=xiβ+ei, the model under H0, P0=IX(XTX)−1X, and

  • image

which is a quadratic function of y and follows a mixture of chi-squares under H0.

Following Zhang and Lin (2002), for each fixed ρ, we use the Satterthwaite method to approximate the distribution of Qτ(·; ρ) by a scaled chi-square distribution κχ2ν, where the scale parameter κ and the degrees of freedom ν are calculated by equating the mean and variance of Qτ(·; ρ) and those of κχ2ν. Specifically, one can show that inline image and inline image, where inline image, and inline image. Computation of the proposed score test is quite simple, because one only needs to fit the simple linear model yi=xTiβ+ei. We evaluate the performance of the score test using simulations.

5. Model Selection within the Kernel Machine Framework

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

The kernel machine method requires a kernel function to be explicitly specified. Section 2.2 provides wide choices of kernel functions. A question of substantial interest is which kernel function to choose. This kernel selection problem has much broader implications. We consider two types of kernel selection problems. The first is to choose between different parametric and nonparametric models with different smoothness properties. The second problem involves variable selection.

As stated in Section 2.2, a kernel function fully specifies a function space inline image where the unknown function h(·) resides. Hence this function space determines the type of models used to fit h(·). For example, a dth-degree polynomial kernel specifies a parametric model with dth order monomials; the kernel inline image specifies a cubic smoothing spline model (Wahba, 1990); and the Gaussian kernel assumes an infinitely smooth function. It is therefore clear that model selection within the kernel machine framework is in fact a special case of kernel selection.

Variable selection can also be treated as a kernel selection problem within the kernel machine framework. For example, let zp be a p-dimensional vector and inline image a p dimensional sub-vector of zp with p′ < p. Then two kinds of kernel functions can be specified: one based on zp and another one based on inline image. The unknown function can then be fitted separately based on each kernel. If the fitted curves are not “far away” from each other, then the model using inline image provides an equally good but more parsimonious fit than that using zp. This demonstrates that variable selection is also a special case of kernel selection.

These discussions show that model selection is a very interesting and important topic within the kernel machine framework. However, little work has been done in this area. We propose AIC and BIC as kernel selection criteria within the kernel machine framework. Equations (5) and (7) show that the estimated response inline image can be expressed as inline image where A= (I−1K)−1−1K+X{XT(I−1K)−1X}−1XT(I−1K)−1] is the LSKM smoothing matrix. Let r= trace(A) be the degree-of-freedom of the kernel machine smoother A. We define the least squares kernel machine (KM) AIC and BIC as

  • image

where inline image. Models with smaller KM_AIC/KM_BIC values are preferred.

6. Application to the Prostate Cancer Genetic Pathway Data

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

We applied the proposed semiparametric model to the analysis of prostate cancer genetic pathway data described in Section 1. The data set contained 59 patients who were clinically diagnosed with local or advanced prostate cancer. The objective of the study was to evaluate whether a genetic pathway has an overall effect on PSA after adjusting for covariates. We focus in this article on the cell growth pathway, which contains five genes. The outcome was pre-surgery PSA level. A log transformation was performed to make the normality assumption plausible. Two covariates included age and Gleason score, a well-established histological grading system for prostate cancer.

The semiparametric model (1) provides a convenient framework to evaluate the effect of the cell growth pathway on PSA by allowing for complicated interactions among the genes within the pathway. Specifically, we consider the model

  • image(18)

where h(·) is a nonparametric function and eN(0, σ2). We fit this model using the LSKM method via the linear mixed model representation (11) and using the Gaussian kernel in estimating h(·). Under the linear mixed model representation, we estimated 0, β1, β2) and h(·) using BLUPs, and estimated the smoothing parameter τ, the kernel parameter ρ and the residual variance σ2 simultaneously using REML. The results are presented in Table 1, indicating Gleason score was highly significant, while age was not.

Table 1.  Parameter estimates of the semiparametric model and the score test for the genetic pathway effect for the PSA data using the LSKM via the linear mixed model representation
CovariateEstimateSEp-value
Intercept−1.7722 1.19150.1425
Age 0.0177 0.01140.1259
Gleason 0.4461 0.10550.0001
τ 2.8182 3.7720·
ρ 6.363513.5708·
σ2 0.3712 0.08160.001 
 
ρinline imageνp-value
 
Score test for the genetic pathway effect H0 : h(z) = 0
331.01014.9240.0085
528.75011.2230.0028
1026.598 8.2950.0010
3023.264 5.9700.0007

We tested for the cell growth pathway effect on PSA, H0 : h(z) = 0 versus H1 : h(z) ∈HK using the score test described in Section 4.3. Table 1 gives the score test statistics and p-values for a range of ρ values. The p-values are not sensitive to the choice of ρ and range from 0.0007 to 0.0085, suggesting a strong cell growth pathway effect on PSA.

Even though the five genes are believed to function together biologically, it is of interest to investigate whether there are a small number of relatively important genes in the cell growth pathway that most affect PSA. We investigated this problem using the proposed variable selection method. An all-possible-subset selection procedure of genes was performed using the Gaussian kernel. The kernel machine AIC and BIC proposed in Section 5 were used as the model selection criteria. The result shows that the model with the lowest AIC and BIC values is the one containing genes FGF2 and IGFBP1. The detailed results are given in Web Table 1 in the Supplementary Materials. These two genes can be studied further in laboratory settings to explore their detailed relationship with PSA.

7. Simulation Studies

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

7.1 Simulation Study for the Parameter Estimates

We conducted a simulation study to evaluate the performance of the proposed LSKM estimation method for the semiparametric model (1) by fitting the linear mixed model (11). We considered the following model

  • image(19)

where eiN(0, 1). To allow for xi and (zi1, … , zip) to be correlated, xi was generated as xi= 3cos(zi1) + 2ui with ui being independent of zi1 and following N(0, 1), zij(j= 1, … , p) were generated from Uniform(0, 1). The nonparametric function h(·) was allowed to have a complex form with nonlinear functions of the z's and interactions among the z's. In our simulations, we first fit the model using the same set of z's as that in the true model. In practice, without advanced knowledge, the true set of z's is often unknown and the set of z's that is used might be larger than the true set and contains some noisy z's that are irrelevant to the outcome y. To mimic such a scenario, in the second set of simulations, we added some noisy z's in the set of z's and fit (19).

We considered four configurations by varying n (the sample size) and p (the number of covariates z's). For each setting, only the Gaussian kernel is used and 300 simulations were run.

Setting 1: n= 60, p= 5, true h(z) = 10cos(z1) − 15z22+ 10exp(−z3)z4− 8sin(z5)cos(z3) + 20z1z5. Fit the model with the five true z's. This setting mimics the PSA data.

Setting 2: n= 100, p= 8, h(·) is the same as setting 1. Fit the model (19) by including 3 additional irrelevant z6, z7, z8 besides the true z1, … , z5.

Setting 3: n= 200, p= 10, true h(z1, … , z10) = 10cos(z1) − 15z22 + 10exp(−z3)z4− 8sin(z5)cos(z3) + 20z1z5+ 9z6sin(z7) − 8cos(z6)z7+ 20z8sin(z9)sin(z10) − 15z38− 10z8z9− exp(z10)cos(z10). Fit the model assuming these 10 true z's are used.

Setting 4: n= 300, p= 15, h(·) is the same as that in setting 3. Fit the model with additional 5 irrelevant noisy predictors z11, … , z15 besides the true z1, … , z10.

The point estimate results are presented in Table 2. Because it is difficult to graphically display the fitted value of h(·) as a function of z, we summarized the goodness of fit of h(·) in the following way. For each simulation data set, we regressed the true h on the fitted inline image, both evaluated at the design points. We then empirically summarized the goodness of fit of inline image by reporting the average intercepts, slopes, and R2's obtained from these regressions over the 300 simulations. If the intercept from this regression is close to zero and the slope is close to one and R2 is close to one, it would provide empirical evidence that the estimated multi-dimensional function h(·) is close to the true manifold.

Table 2.  Simulation results of estimated regression coefficientsβand the nonparametric functionh(·)in modely=xβ+h(z)+ebased on 300 runs. Trueβ=1and trueσ2=1
SettingTrue #zUsed #znModel parameter estimatesReg of h on inline image
βσ2ρInterceptSlopeR2
  1. aAverage of the estimated inline image from 300 simulations.

1 5 5 601.000.96  5.34a (estimated)−0.041.000.99
1001.010.96  7.24 (estimated)−0.011.000.99
1001.000.92  1.00 (fixed)−0.011.000.99
1001.001.01100.00 (fixed)−0.021.000.99
2 5 81001.050.89  6.74 (estimated) 0.161.000.98
1001.060.30  1.00 (fixed) 0.360.980.97
1001.122.15100.00 (fixed) 0.231.010.96
310102000.980.93 12.83 (estimated)−0.071.000.99
2000.920.30  1.00 (fixed)−0.180.990.98
2000.981.15100.00 (fixed)−0.041.000.99
410153001.010.82 14.02 (estimated) 0.031.000.99
3001.010.75 10.00 (fixed) 0.021.000.99
3001.011.17100.00 (fixed) 0.021.000.99

The results in Table 2 show that, when the true set of z's was included in fitting h(·) and all the model parameters {β, h(·), τ, ρ, σ2} were estimated simultaneously, the LSKM method via the mixed model framework performed well in estimating β, h(·) and σ2. However, if the scale parameter ρ in the Gaussian kernel was fixed, which is often done in traditional machine learning, the model estimators could be subject to considerable bias, especially for the estimate of σ2. When ρ was fixed at values close to the estimated one, the bias was small. Because in practice, ρ is unknown, our results suggest it is useful to estimate the scale parameter ρ using the data. When extra irrelevant covariates z's besides the true set of z's were used in fitting h(·), the proposed method still performed well if all model parameters were estimated.

Table 3 compares the estimated standard errors of inline image using the frequentist method (12) and the Bayesian method (14) with the empirical ones. The results show that both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. Table 3 also compares the estimated standard errors of inline image (including intercept) using the frequentist method (13) and the Bayesian method (15) with the empirical standard errors. For the ease of presentation, for each setting, we averaged the SE estimates across all the grid points and presented these averages. The results show that when the scale parameter ρ was estimated, both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. When the scale parameter was fixed, the Bayesian and frequentist SEs were still close but could be quite different from the empirical SEs. These results further indicate that it is useful to estimate the scale parameter ρ in practice.

Table 3.  Simulation study results of standard error estimates ofinline imageandinline imagein modely=xβ+h(z)+ebased on 300 simulations
SettingTrue #zUsed #znEmpirical SEBayesian SEFrequentist SEρ
 SEs of inline image 
1 5 5600.0880.0880.083  5.34 (estimated)
1000.0540.0570.055  7.24 (estimated)
1000.0620.0660.058  1.00 (fixed)
1000.0550.0560.055100.00 (fixed)
2 5 81000.0660.0650.058  6.74 (estimated)
1000.0700.0780.034  1.00 (fixed)
1000.0820.0810.078100.00 (fixed)
310102000.0440.0470.042 12.83 (estimated)
2000.0500.0770.024  1.00 (fixed)
2000.0410.0470.045100.00 (fixed)
410153000.0390.0420.033 14.02 (estimated)
3000.0390.0440.032 10.00 (fixed)
3000.0370.0410.039100.00 (fixed)
 SEs of inline image 
1 5 5600.6350.6620.601  5.34 (estimated)
1000.4820.5150.464  7.24 (estimated)
1000.6140.6640.576  1.00 (fixed)
1000.4580.4700.456100.00 (fixed)
2 5 81000.6620.6830.604  6.74 (estimated)
1000.9330.5400.449  1.00 (fixed)
1000.7410.7310.645100.00 (fixed)
310102000.6060.6670.583 12.83 (estimated)
2000.9540.5410.450  1.00 (fixed)
2000.5590.6300.596100.00 (fixed)
410153000.7120.7210.636 14.02 (estimated)
3000.7370.7170.634 10.00 (fixed)
3000.6320.7320.684100.00 (fixed)

7.2 The Simulation Study for the Score Test

We next conducted a simulation study to evaluate the performance of the proposed variance component score test for H0 : h(·) = 0 versus inline image. The true model is the same as (19), where x and z's were generated in the same way as that in Section 6.1 and inline image and a= 0, 0.2, 0.4, 0.6, 0.8, 1. We studied the size of the test by generating data under a= 0, and studied the power by increasing a. The kernel parameter ρ was fixed at a wide range of values: 0.5, 1, 5, 10, 25, 50, 100, 200. The sample size was 60, mimicking the PSA data example. For the size calculations, the number of simulations was 2000, whereas for the power calculations, the number of runs was 1000.

Table 4 reports the empirical size (a= 0) and power (a > 0) of the variance component score test for H0. The results show that the size of the test was very close to the nominal value 0.05 and was not sensitive to the choice of the scale parameter ρ. As a increased, the power quickly approached 1. The power was not much affected by the value of ρ if a moderate ρ was specified, but was more affected if a large value of ρ was specified

Table 4.  Simulation results for the score test forH0:h(z)=0
Scale ρSizePower
α= 0α= 0.2α= 0.4α= 0.6α= 0.8α= 1.0
0.50.0500.1580.4870.8650.9891.000
10.0470.1370.5090.8690.9911.000
50.0500.1270.4820.8650.9871.000
250.0510.1390.4840.8860.9901.000
500.0460.1380.5080.8630.9901.000
1000.0480.1340.4970.8670.9881.000
2000.0540.1480.4940.8740.9911.000

7.3 The Simulation Study for Kernel Selection

A simulation study was also conducted to assess the performance of kernel selection using the kernel machine AIC and BIC criteria. The true model we considered is

  • image

where eN(0, 1), x was generated as x= 3 cos(z1) + 2u with u being independent of z1. All u and zj(j= 1, … , 5) were generated from N(0, 1). The sample size was 50, and the number of runs was 300. Three types of kernel functions were used in the simulation: the Gaussian kernel K(u, v) = exp(−∥uv2/ρ), the second-degree polynomial kernel K(u, v) = (uTv+ 1)2, and the first-degree polynomial kernel that corresponds to ridge regression K(u, v) =uTv. For each simulated data set, the AIC and the BIC were calculated based on the model with three different kernels.

The mean AIC and BIC across 300 simulations for the Gaussian kernel are 190.79 (51.31) and 284.21 (50.21), respectively (the numbers within parenthesis are standard deviations), those for the second-degree polynomial kernel are 269.07 (10.00) and 308.91 (9.58), respectively, and those for the ridge regression are 363.67 (2.63) and 371.61 (2.51), respectively. The AIC and BIC values from each simulated data set are plotted in Figures 1 and 2. These results show that the kernel machine AIC and BIC of the model with Gaussian kernel are the smallest, whereas those of ridge regression are the largest. Hence the Gaussian kernel is preferred to both the second-degree polynomial kernel and the ridge regression kernel, which is desired in light of the complicated functional forms of the x's.

Figure 1. Simulation result of model selection using KMAIC.

Download figure to PowerPoint

image

Figure 2. Simulation result of model selection using KMBIC.

Download figure to PowerPoint

image

8. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

In this article, we have developed the LSKM method for semiparametric regression with Gaussian outcomes, where we model the covariate effects parametrically and the genetic pathway effect parametrically or nonparametrically. The kernel machine method does not require an explicit analytical specification of the smoothness conditions on the nonparametric function and unifies the model building procedure in both one- and multiple-dimensional settings. Therefore, it is a more general and flexible method for multidimensional smoothing.

A key contribution of this article is that we have established a close connection between kernel machine methods and linear mixed models and all the model parameters can be estimated within the unified linear mixed model framework. This mixed model connection greatly facilitates the estimation and inference for multidimensional nonparametric regressions and can be easily implemented using familiar statistical software such as SAS PROC MIXED or Splus NLME.

We proposed a score test for the genetic pathway effect. This can be easily implemented using existing software. Although it requires fixing the scale parameter ρ, our results show that the test is not sensitive to the choice of ρ and has good performance. Alternatively, a Bayesian approach, such as the one proposed by Chen and Dunson (2003), might be used. This method has the advantage that there is no need to fix the scale parameter by proper prior specifications. However, its theoretical properties are unknown. It is of further research interest to study the performance of this Bayesian method and to develop better frequentist methods of testing τ in the kernel machine setting.

Kernel selection within the kernel machine framework is an important and complicated problem. It includes model selection and variable selection as special cases. In this article we propose to use kernel machine AIC/BIC as kernel selection criteria. Our simulation results show AIC/BIC performs well. Further research is still needed to examine their theoretical properties in detail before they can be adopted as a universal criteria.

We have considered in this article a single nonparametric function of multi-dimensional covariates. One could generate the proposed semiparametric model to incorporate multiple multi-dimensional nonparametric functions. For example, if one is interested in modeling multiple genetic pathway effects, one could consider an semiparametric additive model

  • image

where zj(j= 1, …, m) denotes a pj× 1 vector of genes in the jth pathway and hj(·) denotes the nonparametric function associated with the jth genetic pathway.

Machine learning is an emerging area of research in statistics. The field has experienced a rapid development in the past decade mainly by computer scientists dealing with multi-dimensional data. It has shown increasing promises and wide applications in biomedical research, especially in bioinformatics. These techniques however are somewhat disconnected with well-established biostatistical methods. Our effort of establishing a close connection between LSKMs and linear mixed models is an attempt to build a bridge between kernel machines that are familiar to computer scientists but less familiar to biostatisticians. This connection opens a door for adopting other well-established statistical techniques used in mixed models, such as Bayesian approaches, to handle multidimensional data via the machine learning framework. It also opens a new research direction for model/variable selection methods within the kernel machine framework. Such an interface is still in its infancy and has a lot of room for further developments.

9. Supplementary Materials

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

The kernel machine AIC and BIC estimates of models containing all the subsets of genes in the cell growth pathway for the analysis of the prostate cancer data are given in Web Table 1 at the Biometrics website http://www.tibs.org/biometrics.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information

DL and XL's research was supported by a grant from the National Cancer Institute (CA–76404). DG's research was supported by a grant from the National Institute of Health (GM072007). We thank the associate editor and three reviewers for their helpful comments that have improved the article.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semiparametric Model for Multidimensional Data
  5. 3. LSKM Estimation in the Semiparametric Model
  6. 4. LSKMs and Linear Mixed Models
  7. 5. Model Selection within the Kernel Machine Framework
  8. 6. Application to the Prostate Cancer Genetic Pathway Data
  9. 7. Simulation Studies
  10. 8. Discussion
  11. 9. Supplementary Materials
  12. Acknowledgements
  13. References
  14. Supporting Information
  • Buhmann, M. D. (2003). Radial Basis Functions. Cambridge, U.K. : Cambridge University Press.
  • Chen, Z. and Dunson, D. B. (2003). Random effects selection in linear mixed models. Biometrics 59, 762769.
  • Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge, U.K. : Cambridge University Press.
  • Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74, 3343.
  • Dhanasekaran, S. M., Barrette, T. R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K. J., Rubin, M. A., and Chinnaiyan, A. M. (2001). Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822826.
  • Efron, B., Tibshirani, R., Storey, J., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, 11511160.
  • Fortunel, N. O., Otu, H. H., Ng, H. H., Chen, J., Mu, X., Chevassut, T., Li, X., Joseph, M., et al. (2003). Comment on “ ‘Stemness’: Transcriptional Profiling of Embryonic and Adult Stem Cells” and “A Stem Cell Molecular Signature.” Science 302, 393.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics 19, 1141.
  • Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association 76, 817823.
  • Goeman, J. J., Oosting, J., Cleton-Jansen, A.-M., Anninga, J. K., and van Houwelingen, H. C. (2005). Testing association of a pathway with survival using gene expression data. Bioinformatics 21, 19501957.
  • Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. London : Chapman and Hall.
  • Gu, C. (2002). Smoothing Spline ANOVA Models. New York : Springer.
  • Harville, D. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72, 320340.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. London : Chapman and Hall.
  • Kimeldorf, G. S. and Wahba, G. (1970). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications 33, 8295.
  • Laird, N. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38, 963974.
  • Mootha, V. K., Lindgren, C. M., Eriksson, K. F., et al. (2003). PGC-1alpha responsive genes involved in oxidative phosphorylation are coordinately Downregulated in human diabetes. Nature Genetics 34, 267273.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Cambridge, Massachusetts : MIT Press.
  • Ruppert, D., Wand, M. P., and Carroll, R. J. (2004). Semiparametric Regression. Cambridge, U.K. : Cambridge University Press.
  • Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. Cambridge, Massachusetts : MIT Press.
  • Self, S. G. and Liang, K. Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions. Journal of the American Statistical Association 82, 605610.
  • Sollich, P. (2002). Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning 46, 2152.
  • Speed, T. (1991). Discussion to “BLUP is a good thing: The estimation of random effects” by Robinson, G. K. Statistical Sciences 6, 1551.
  • Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., and Mesirov, J. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 1554515550.
  • Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, J., and Vandewalle, J. (2002). Least Squares Support Vector Machines. Singapore: World Scientific .
  • Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211244.
  • Tusher, V., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 98, 51165124.
  • Vapnik, V. (1998). Statistical Learning Theory. New York : Wiley.
  • Wahba, G. (1990). Spline Models for Observational Data. Philadelphia : SIAM Press.
  • Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. London : Chapman and Hall.
  • Wang, Y. (1998). Smoothing spline models with correlated random errors. Journal of the American Statistical Association 93, 341348.
  • Zhang, D. and Lin, X. (2002). Hypothesis testing in semiparametric additive mixed models. Biostatistics 4, 5774.
  • Zhang, D., Lin, X., Raz, J., and Sowers, M. (1998). Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association 93, 710719.