SEARCH

SEARCH BY CITATION

Keywords:

  • Functional data;
  • Functional logistic regression;
  • Gene expression profile;
  • Local wavelet-vaguelette decomposition;
  • Yeast cell-cycle gene expression data

Abstract

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

This paper focuses on the problem of functional statistical classification of gene expression curves. A local-wavelet-vaguelette-based functional logistic regression approach is presented. This approach is specially suitable for the classification of non-stationary singular (non-differentiable) curves. The performance of the methodology proposed is illustrated by implementing it for the classification of yeast cell-cycle temporal gene expression profiles. A simulation study is also carried out for comparison with other functional classification methodologies.

1 Introduction

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

Several researchers have been devoted to detect and quantify gene expression levels under different scenarios (biological cells). In the recent literature, microarray techniques have reached an important role in the molecular biology research (see, for example, Draghici, 2003; Speed, 2003; Ewens and Grant, 2005). Local variation is an important feature in gene expression microarray data, since under fixed experimental conditions, gene expression levels are given by different quantification values, induced by the external factors acting at the different steps of the biological and statistical procedures involved (see, for instance, Schuchhardt et al., 2000). The design of statistical classification procedures, involving an optimal approximation of gene expression local variability, then becomes crucial. This fact motivates the formulation of the local wavelet-vaguelette-based (LWV-based) functional statistical classification procedure proposed in this paper. The local fitting of the LWV sample projections of gene expression curves leads to a more accurate reproduction of their non-stationary local variation. In the approach presented in this paper, genes expression profiles are interpreted as the trajectories of a continuous Gaussian process. The LWV transform provides a non-redundant description of the second-order (i.e. distributional) properties of a Gaussian stochastic process, in terms of independent random coefficients. The proposed functional logistic discrimination procedure is implemented in terms of the LWV sample projections of gene expression curves. Dimension reduction is achieved by truncation of the infinite-dimensional series defining LWV expansion.

Alternative functional statistical classification methodologies can be found in Rice and Wu (2000); Aach and Church (2001); Hall et al. (2001); James and Hastie (2001); Liu and Müller (2003); Yao et al, 2003; Zhao et al. (2004); Cardot and Sarda (2005); Müller (2005); Ferraty and Vieu (2006); Leng and Müller (2006a, b); Berlinet et al. (2008); Müller et al. (2008), among others. In particular, wavelet bases for dimension reduction have been considered in Berlinet et al. (2008), where data-driven thresholding techniques are applied in a general setting. Here, the selection problem associated with the choice of an optimal and consistent discrimination procedure, from a given family of classifiers, is addressed in the context of inifinite-dimensional explanatory variables and binary responses. The other references cited study alternative bases of functions. For example, Leng and Müller (2006a) apply functional logistic regression in terms of the eigenfunctions of the covariance operator of the explanatory variables, while dimension reduction is achieved in terms of normalized B-splines in Cardot and Sarda (2005) considering the exponential setting. Recently, classical Multidimensional Scaling has been applied for the dimension reduction in Wu and Müller (2010), with the aim of finding a low-dimensional configuration of the observed high-dimensional subjects by retaining a given pairwise distance (dissimilarity).

Gene expression analysis plays, in particular, a crucial role in the study and control of the cell-cycle (see Cho et al., 1998; Spellman et al., 1998; Laub et al., 2000; Breyne and Zabeau, 2001; Cho et al., 2001; Rustici et al., 2004; Peng et al., 2005; Kimet al. 2008, among others). In this context, different levels of local structural variability are detected (see Klevecz, 2000, on wavelet analysis of yeast cell-cycle gene expression data and mRNA levels; Holter et al., 2001, on singular-value-based gene expression dynamic models associated with the cell-cycle; Briones and Bosco, 2009, on genome-wide expression noise and decoherence in global gene expression profiles, among others).

Frequently, misclassification problems arise when genes related to different phases display similar expression patterns, e.g. genes associated with phase transitions. Optimal projection methods must then be selected to obtain a reliable reproduction of the non-homogeneous local variation properties of gene profiles. For example, in Leng and Müller (2006a), gene expression curves with high local variation levels are identified as outliers, when Functional Principal Component Analysis (FPCA) is applied. The distributional characteristics of the corresponding random projections do not depend on time (global model fitting). Therefore, the local structural variability induced by the non-homogeneous variance of the underlying stochastic model is not reflected in the approximation provided by FPCA. However, the local fitting performed with our approach, where sample projections depend on time, leads to a closer reproduction of local variation properties. In particular, high local singularity in gene expression curves can be suitably processed in terms of the transformed wavelet bases involved in the LWV decomposition. These two key features of our approach induce an improvement of the misclassification rate of G1 -phase genes obtained in Leng and Müller (2006a) by applying FPCA to the data set provided by Spellman et al. (1998) (see Section 3). Note that G1 phase constitutes a promising target for the research and treatment of cancer. Additionally, important genes associated with G1 regulation have been shown to play a key role in proliferation, differentiation and oncogenic transformation and programmed cell death (apoptosis). The classification results derived from the application of our approach to the data set obtained by Spellman et al. (1998) are compared with the discrimination results established from functional nonparametric supervised classification (see Section 3.1), in terms of different distances and kernels as given in Ferraty and Vieu (2006). A simulation study is also developed, providing an illustration of different model scenarios where the proposed LWV functional classification can or cannot outperform the nonparametric functional discrimination methodology (see Section 3.2).

1.1 Approach

As commented in Section 1, the aim of this paper is to provide a functional statistical classification procedure for discrimination between genes related to G1 phase and genes associated with other regulation phases (non-G1 genes). From now on, genes related to G1 phase will be referred as genes in G1 group, and the rest of regulation-phases considered are associated with genes in G0 group. The main steps involved in the proposed LWV-based functional logistic classification methodology are the following:

  • Step 1. Estimate the mean and covariance functions from the observed gene expression profiles.

  • Step 2. Compute the empirical LWV decomposition of gene expression curves.

  • Step 3. Estimate, by iterated weighted least squares, the LWV coefficients of the parameter function associated with the infinite-dimensional generalized linear model considered.

  • Step 4. Compute the parametric estimate of the conditional probability of belonging to G1 phase for each gene expression curve in the sample. Its classification is then established, according to prior probabilities given for each group.

2 Methods

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

Gene expression profiles are interpreted as independent realizations of a mean-square integrable Gaussian process X(t) on [0, S]. A functional sample of size M of gene expression curves is considered. Specifically, the observation of the i-th sample function at time th is denoted by Xi(th), for h = 1,…,n, and i = 1,…,M. The mean function μX(t) is approximated by a local polynomial kernel estimator equation image, computed from the longitudinal data (see Appendix, and Muller, 1987, Wu and Zhang, 2006). Epanechnikov kernel is chosen for the estimation of the mean function, and spherical Epanechnikov kernel for the estimation of the covariance function. In particular, a multivariate locally weighted least square kernel estimator equation image of the covariance function CX(s, t), based on spherical Epanechnikov kernel (see Fan and Gijbels, 1996, and Appendix), is computed from the empirical covariance function equation image given by

  • equation image

for hm, h, m = 1,…,n. Such an estimator, equation image, is evaluated on a grid with N = 2p, equation image, equally spaced points in [0, S]. The diagonal elements σ2(t) = CX(t, t), t ∈ [0, S], are approximated by interpolation in terms of equation image, t ∈ [0, S].

Remark 2.1 Parameter p is selected according to the local regularity properties (respectively, local singularity features) of the gene expression profiles. Thus, large values of p are needed when function X displays high local singularity, i.e. high local variability that must be collected at the microscale level. The regularity properties of the wavelet functions, as well as the number of their vanishing moments also affect the selection of p. Without prior information on the local variability features of X, p can be chosen by cross-validation. On the other hand, in the kernel-based nonparametric estimation of the covariance function of the functional explanatory variables (gene expression profiles), the bandwidth parameter must also reflect the local regularity (respectively, local singularity features) of the data. That is, small values of the bandwidth parameter are selected when high local variability is displayed by the gene expression curves. Note that, in the Gaussian case considered, there exists a direct connection between local regularity properties of the sample curves and local regularity of the covariance function of the explanatory variables (see Adler, 1981). Parameter p and the bandwidth are then closely related, since when p increases the bandwidth must decrease. In the particular case where the mother wavelet displays similar regularity properties, and vanishing moment conditions to the Epanechnikov kernel, an explicit inverse relationship can be obtained between p and the bandwidth parameter. When prior information on the local regularity of the values of the functional explanatory variables is not available, cross-validation methodology is applied for testing the bandwidth parameter fitting.

2.1 Multiresolution-like analysis

The LWV decomposition is based on the factorization of the covariance function. The empirical eigenvalues equation image, and the corresponding empirical eigenvectors equation image, of the estimated covariance matrix equation image, allow us to construct the empirical kernel equation image factorizing the covariance function equation image as follows (see Appendix where the description of the theoretical approach is given):

  • equation image

for h, m = 1,…,N. The kernel lX defining the inverse equation image of operator equation image, with kernel tX, can then be formally approximated as

  • equation image

for h,m = 1,…,N. The empirical LWV functions are then computed in terms of kernels equation image, equation image, and a given orthonormal wavelet basis. We have chosen Haar system with the father wavelet, ϕ(x) = I[0,1)(x), and the mother wavelet, ψ(x) = I[0,1/2)(x)−I[1/2,1)(x) (see Vidakovic, 2006). In the following, for certain S>0, and for each x ∈ [0,S], denote, for equation image, by equation image the translated Haar father wavelet ϕ at resolution level j0, and, for k = 1,…,N(j), jj0, denote by equation image the translated mother wavelet ψ at resolution level j (see Appendix A.2).

Remark 2.2. The wavelet basis must be selected according to the local regularity and temporal correlation properties characterizing gene expression profiles. We consider Daubechies wavelet family because of its desirable properties in relation to our approach. Specifically, they are orthogonal and compactly supported. The first feature ensures the uncorrelation of the LWV coefficients (see Appendix), providing a non-redundant description of gene expression curves. Moreover, the compact support of these wavelet functions avoids border effects in our analysis. Finally, we will highlight the easy identification of their local regularity properties and moment conditions (support length) in terms of a unique parameter, which constitutes a crucial feature in our approach for appropriate selection of a wavelet basis. Specifically, we consider the empirical Hölder spectra of gene expression profiles for identification of the suitable order V of vanishing moments, characterizing the basis to be selected from the wavelet Daubechies family. Along Section 3, in the statistical analysis of the data set obtained by Spellman et al. (1998), and in the simulation study developed in Section 3.2, Haar wavelet basis is considered. Therefore, we will refer to this basis in the subsequent development.

For each t ∈ [0, S], and h = 1,…,N, denoting, as before, by j0, the resolution level selected as coarsest scale, we have

  • equation image
  • equation image

In matrix form, for each t ∈ [0, S], we denote, for equation image, by equation image the vector with entries equation imageequation image, given by the product of the matrix equation image, with equation image, for h,m = 1,…,N, and the vector equation image, with equation image, for m = 1,…,N. Similarly, for j = j0,…,p−1, denote by equation image the matrix with entries equation image, for h = 1,…,N, k = 0,…,2j−1, where equation image, equation image, and where, for each equation image, equation image is the vector with entries lm,k = ψj,k(tm), for m=1,…,N. Additionally, for each equation image, the vector equation image has entries equation image, equation image, and, for each j = j0,…,p−1, the matrix equation image has entries equation image, for m = 1,…,N, and k=0,…,2j−1, with equation image. The following local empirical coefficients are computed at time t ∈ [0, S], for each sample curve Xi,

  • equation image(1)

where equation image denotes the locally re-scaled empirical dual Riesz basis of equation image(see Appendix). The M sample curves can then be approximated in terms of the following empirical LWV decomposition: For i = 1,…,M, and for each t ∈ [0, S]

  • equation image(2)

This decomposition will be considered in the implementation of the functional logistic regression in the following section.

2.2 Functional logistic regression

Generalized Linear Models (GLM) (see, for example, McCullagh and Nelder, 1989; Dobson and Barnett, 2008) constitute a flexible extension of classical linear models where the response variables, Y1,…,YM, are independent and identically distributed (i.i.d.), with probability distribution in the exponential family. The logistic regression model is defined in terms of a set of parameters β1, …,βH, the explanatory variables equation image, the logit link function g(x) = log(x/(1−x)), such that E(Yixi1,…,xiH) = μiY = g−1i), where ηi = α + ∑ xijβj, i = 1,…,M, with α being a constant. The response variable Yi is then conditionally distributed as a Bernoulli with mean μiY and variance μiY(1−μiY), for i = 1,…,M.

In the functional formulation of the above-described logistic regression model (see, for example, Müller and Stadtmüller, 2005), the parameter β is a square integrable function, and the explanatory variables Xi, i = 1,…,M, are functional variables with values in a separable Hilbert space H, e.g. H = L2([0, S]), with [0, S], as before, being a real interval, and

  • equation image(3)

where α is a constant. The linear predictors ηi, i = 1,…,M, define the conditional mean and variance, E(YiXi(t)) = μiY = g−1i) and Var(YiXi(t)) = σ2iY) = μiY(1−μiY), for each binary response variable Yi, i = 1, …,M, through the logit function g. Thus,

  • equation image

where the errors ei, i = 1,…,M, are considered to be independent random variables with zero-mean and finite variance.

Due to the square integrability of β, this parameter function admits the local decomposition:

  • equation image(4)

in terms of the locally scaled dual Riesz bases equation image and equation image. In the development below, the local Fourier coefficients of parameter function β, with respect to the empirical locally scaled basis equation image, will be denoted as equation image, for equation image, and equation image, for equation image, equation image, and for each t ∈ [0, S]. The previous derived approximations of equation image from (2), and of β(t) from (4), in terms of the empirical LWV dual Riesz bases, lead to the following local approximation equation image of ηi, for i = 1,…,M,

  • equation image

The functional model is then locally reduced to a generalized linear model (see McCullagh and Nelder, 1989; Dobson and Barnett, 2008), for each t ∈ [0, S], where the parameters

  • equation image

are estimated by solving the score equation:

  • equation image(5)

where σ2iY) = μiY(1−μiY), equation image denotes the transpose of the vector of Fourier coefficients of Zi(t) with respect to the empirical LWV basis equation image. Define now the vector

  • equation image(6)

A prior probability p0 is considered for G0 memberships, and similarly, a prior probability p1 is considered for G1 memberships. Thus, if for the i-th observation Xi(t), equation image, the sample curve Xi(t) is a member of G1. Otherwise, it belongs to G0. Here,

  • equation image

where

  • equation image(7)

Remark 2.3 The approximation of p1 and p0 can be achieved from the respective mean proportions of sample genes clustered into each group, G1 and G0, after applying supervised or unsupervised clustering over a set of learning samples. The main drawback in the application of supervised clustering is that two target or reference gene expression groups must be previously established. The gene target subset selection problem has been studied from different perspectives, for example, Van Der Laan and Bryan (2001) propose a deterministic rule applied to the parameters of the gene expression distribution, previously estimated by parametric bootstrap, under the Gaussian assumption. More sophisticated methods related to sample-to-group regulation probability estimation can also be considered (see Wang and Huang, 2005, who apply maximum likelihood criterion for estimation) in the supervised organization of gene expression profiles into two groups.

When no prior information is available on the number of members of each group, G1 and G0, unsupervised clustering can be applied (see Kaufman and Rousseeuw, 1990; Eisen et al., 1998, among others). In Section 3.2, k-mean clustering, with k = 2, based on standard correlation coefficient, will be implemented to organize into G1 and G0 groups the genes with similar expression patterns. Specifically, denoting by N0 the number of genes clustered into G0 group, and by N1 the number of genes clustered into G1 group, the respective probabilities p0 and p1 of belonging to each group are approximated as follows:

  • equation image(8)

Note that unsupervised clustering, as a data exploratory analysis tool, provides a preliminary organization of data for further posterior statistical analysis (or discrimination). It is worthwhile noted that any unsupervised clustering based on shape similarity can be applied for the approximation of the group probabilities, p0 and p1, from Eq. (8) in Section 3.2 (see, for example, Hestilow and Huang, 2009).

3 Analysis of yeast cell-cycle gene expression profiles

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

We consider the data set obtained by Spellman et al. (1998) consisting of M = 90 gene expression profiles (α factor synchronized) involved in the yeast cell-cycle regulation. The gene expression is measured every 7 min between 0 and S = 119 min (both time instants included). Thus, n = 18 observations are available for each gene. Since it is known that 44 of these genes are related to G1 phase regulation, and 46 to the S, S/G2, G2/M and M/G1 phases, Eq. (8) is applied with N0 = 46, and N1 = 44.

Figure 1 displays the original data with their approximation in terms of the LWV decomposition, considering a grid with N = 64 = 26 equally spaced time points. Convergence of the iterated weighted least squares algorithm is achieved for every point t on the grid, after 100 iterations controlled by the deviance. Estimate equation image for N=64 equally spaced time points, and the mean vector over time, equation image, are displayed in Fig. 1.

thumbnail image

Figure 1. Left panel: Temporal gene expression profiles of yeast cell cycle. Right panel: Reconstruction of the temporal gene expression profiles in left panel using an orthogonal expansion from wavelet-vaguelette transform. Dashed lines: Genes peaking in G1 phase; gray solid lines genes peaking in non-G1 phase; Black solid line: Estimated mean curve.

Download figure to PowerPoint

In order to measure the accuracy of the model, the Leave-One-Out Cross-Validation (LOOCV) error rate is obtained (see Picard and Cook, 1984). Suppose the i-th gene is missing, the estimates of the mean, covariance function, and the parameter vector β are computed from a functional sample of 89 genes expression curves. These estimates are tested considering the approximation ηi of η, based on the training sample conformed by 89 gene expression curves, after removing the i-th gene. For each t ∈ [0, S], the LWV coefficients of the i-th gene, validation data, are computed as follows:

  • equation image

where equation image, equation image and equation image, equation image, equation image, are constructed from the estimated mean function equation image, and the empirical covariance function computed from the training set constituted by 89 genes (see Eq. (1)). This procedure is repeated with every gene, i.e. for equation image if equation image, the i-th gene is member of G1, otherwise it is considered as a member of G0. Here, equation image is obtained from the training sample, equation image is defined as in (6), considering the score equation (5) based on the training set, and equation image is given as in Eq. (7) from the i-th gene, for i = 1,…,M. The LOOCV error rate is computed as the quotient between the total number of misclassified genes under LOOCV, and the total number of genes. High accuracy of the model is assured by an LOOCV error rate CVE = 0.1, corresponding to three misclassified G1 genes plus six misclassified G0 genes.

The three misclassified G1 genes with our approach are: YCL055W, YDL227C, and YJL092W, while with the approach presented in Leng and Müller (2006a) were five: YCL055W, YDR113C, YDR356W, YJL092W and YDL055C. Regarding the common misclassified genes with the two approaches, YCL055W (KAR4) and YJL092W (SRS2), in Spellman et al., (1998) is commented that the first one should be “induced by α factor and so it is very strongly expressed at the beginning of the α factor experiment”. Leng and Müller (2006a) suggest that SRS2 is close to the trajectories of S genes, and KAR4 is close to M/G1 phase regulation as we can see in Fig. 3. On the other hand, the misclassified gene profile with our approach, YDL227C, is upregulated at G1 late phase. Thus, its association with a phase transition of the cell cycle makes its discrimination procedure more difficult. Its expression pattern is similar to S genes. Moreover, a missing expression value of this profile at time t = 7 min increases the difficulty of the discrimination procedure. This fact affects more our classification results, since our (time-dependent) local fitting of the wavelet-vaguelette sample projections of gene profiles is less robust against missing data. (Note that in Leng and Müller (2006a) gene sample projections are independent of time.) This is the price we have to pay for improving the approximation of the local variation of gene trajectories, in order to get a better discrimination based on its expression patterns.

3.1 Comparison with a functional nonparametric supervised classification

In this section, the nonparametric functional statistical classification methodology introduced in Ferraty and Vieu (2006) is applied to the data set obtained by Spellman et al. (1998). This classification procedure is based on the Bayes rule. Given a functional object equation image in E, a Euclidean space, the purpose is to estimate the posterior probabilities of belonging to each group Gj, for j = 1,…,g, i.e. to estimate the posterior probabilities of belonging to each element of a given set of groups equation image Such probabilities are defined by equation image, for j = 1,…,g. Once these posterior probabilities are estimated equation image, the classification rule consists of assigning an incoming functional observation equation image to the class with highest estimated posterior probability:

  • equation image

In Ferraty and Vieu (2006), the following nonparametric kernel estimators pGjLCV(·), equation image of the probabilities of belonging to each group are introduced:

  • equation image

where, for i = 1,…,M, xi is considered as the gene-expression time-course from the i-th gene consisting of 18 observations, as we pointed out before. Here, K is an asymmetrical kernel, d is a semi-metric, i0 = arg mini = 1,…,Md(x,xi) and hLCV(xi0) is the bandwidth corresponding to the optimal number of neighbors at xi0, obtained by the cross-validation procedure described in Ferraty and Vieu (2006). For j = 1,…,g, the estimator of equation image is a functional version of a local polynomial kernel smoother considering a zero-degree polynomial, i.e. the Nadaraya–Watson estimator (see Appendix in relation to the mean and covariance estimation from local polynomial modeling). Since classification is performed with the Bayes rule, considering equation image we have that equation image>equation image means that the m-th observation comes from G0, otherwise it comes from G1. Table 1 shows the LOOCV error rate using quadratic, triangle and box asymmetrical kernels. Two empirical versions of the semi-metrics given by the classical L2-metric, and the principal component analysis (PCA) based semi-metric, dqPCA, with

Table 1. Performance of the estimator equation image, equation image, tested in the yeast cell-cycle database considering LOOCV.
 L2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The error rates obtained with each kernel and semi-metric are shown.

Quadratic26/9018/9020/9028/9023/9026/90
Triangle27/9018/9020/9027/9024/9026/90
Box24/9018/9020/9028/9025/9028/90
  • equation image

are considered. Here, [vk]j is the j-th component of the k-th eigenvector of the covariance matrix equation image, corresponding to the k-th eigenvalue λk, with equation image. The results displayed in Table 1 show that the best performance corresponds to the semi-metric defined from PCA considering four eigenvectors, denoted as PCA4. Note that the results obtained from the application of LWV-based functional logistic classification lead to an LOOCV error rate, CVE = 0.1, which outperforms the nonparametric functional classification in all the cases studied (Fig. 2).

thumbnail image

Figure 2. Components of equation image for 64 equally spaced points t ∈ [0, 119], the coarsest line is the mean vector equation image. Note that the first coefficients produce the biggest influence in the response variable.

Download figure to PowerPoint

thumbnail image

Figure 3. Profiles of misclassified G1 genes. Left panel: Gray solid lines: G1 genes; dashed lines: S genes; black solid line: misclassified gene SRS2 (YJL092W). Right panel: Gray solid lines: G1 genes; dashed lines: M/G1 genes; black solid line: misclassified gene KAR4 (YCL055W).

Download figure to PowerPoint

thumbnail image

Figure 4. Functional estimate equation image (red), and the original function β (blue) are displayed: Top panel: From 1000 sample curves; Middle panel: From 500 sample curves; Bottom panel: From 100 sample curves.

Download figure to PowerPoint

3.2 Simulation study

To investigate the general performance of the methodology proposed, we have carried out a simulation study in terms of some well-known families of stochastic process. The proposed LWV-based functional logistic discrimination is applied. The results obtained are compared with those derived from the application of nonparametric supervised classification.

First, let us consider fractional Brownian motion as the basic model for generating gene expression profiles. Specifically, M gene expression profiles of length N = 26, from the sample paths of fractional Brownian motion with Hurst parameter 0.3, are generated. The Haar wavelet transform is then applied. The selected coarsest scale j0 (i.e. lowest resolution level) is j0 = 6. Thus, 26 translations of the Haar farther wavelet ϕ are considered for generating the draft of each gene expression profile in the space V0 (see, for instance, James et al. 2009). The parameter function is given by:

  • equation image

Then, η sample values are generated from the inner product of the simulated gene expression profiles and the selected β function, for a fixed value of parameter α in (3). The probabilities

  • equation image

are computed. The response vector components Yi, i = 1,…,M, are obtained from Bernoulli distributions with respective probabilities πi, i = 1,…,M.

The LWV functional logistic regression and the functional nonparametric supervised classification are applied, considering functional samples of size M = 100, 500, 1000 gene expression profiles, which are generated at n = 64 time points. Specifically, the mean LOOCV error rates obtained after applying LWV functional logistic regression and nonparametric supervised classification are displayed in Table 2 for functional samples of size M = 100, and in Table 3 for functional samples of sizes M = 500 and M = 1000. In the application of the proposed LWV methodology, the group probabilities are previously estimated as the mean proportion of G1 and G0 memberships over ten generated learning samples of size 1000, after applying k-mean clustering, with k = 2. The corresponding estimated values obtained are equation image and equation image.

Table 2. Fractional Brownian motion: LWV functional logistic regression and functional nonparametric supervised classification results.
 LWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The mean LOOCV error rates are shown, from a functional sample of size M=100 curves measured at n=64 times. (Q, T and B respectively correspond to the asymmetrical quadratic, triangular and box kernels).

n=1000.28Q0.390.480.460.470.470.48
  T0.420.470.450.470.460.48
  B0.40.470.460.480.470.48
Table 3. Fractional Brownian motion: LWV functional logistic regression and functional nonparametric supervised classification results.
Sample sizeLWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The mean LOOCV error rates are shown, from functional samples of respective sizes, M=500 curves measured at n=64 times, and M=1000 curves measured at n=64 times (Q, T and B correspond, respectively, to the asymmetrical quadratic, triangular and box kernels).

n=5000.13Q0.380.460.470.470.450.46
  T0.40.450.450.460.440.45
  B0.380.460.460.470.450.46
n=10000.1Q0.370.470.470.470.470.47
  T0.40.450.450.450.450.46
  B0.390.470.470.460.460.47

It can be observed that LWV functional logistic regression provides better results than nonparametric supervised classification, in the case where fractal expression patterns, displaying high local variability, are studied. Note that fractional Brownian motion constitutes an example of Gaussian process displaying fractal features and long-range dependence due to its self-similar behavior. The better performance of the functional LWV classification procedure against functional nonparametric supervised discrimination is now tested for fractal and long-range dependence Gaussian models, in the non-self-similar case, where micro-scale and macro-scale properties do not coincide. Since, as commented, LWV provides a local fitting in terms of wavelet-like functions suitably reproducing the local self-similar behavior of fractal processes, LWV functional logistic regression also outperforms nonparametric supervised classification in the fractal and long-range dependence non-self-similar case. This fact is illustrated with the following example, given in terms of the centered Gaussian Linnik process with fractal covariance function:

  • equation image

We have considered the parameter values α = 1/3 and σ = 25. The estimated probabilities of G1 and G0 groups, after applying k-mean clustering, with k = 2, over ten generated learning samples of size 1000, are equation image and equation image.

The results obtained after applying LWV and functional nonparametric classification methodologies are displayed in Tables 4 and 5. It can be appreciated that LWV outperforms functional nonparametric supervised classification in all the cases and functional samples considered of respective sizes M = 100, 500, 1000.

Table 4. Gaussian Linnik process: LWV functional logistic regression and functional nonparametric supervised classification results.
 LWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The mean LOOCV error rates are shown, from a functional sample of size M=100 curves measured at n=64 times. (Q, T and B respectively correspond to the asymmetrical quadratic, triangular and box kernels).

n=1000.42Q0.430.450.460.460.460.45
  T0.430.450.450.460.460.45
  B0.430.450.460.470.460.45
Table 5. Gaussian Linnik process: LWV functional logistic regression and functional nonparametric supervised classification results.
Sample sizeLWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The mean LOOCV error rates are shown, from functional samples of respective sizes, M=500 curves measured at n=64 times, and M=1000 curves measured at n=64 times. (Q, T and B correspond, respectively, to the asymmetrical quadratic, triangular and box kernels).

n=5000.35Q0.430.410.420.410.410.41
  T0.430.410.410.410.410.41
  B0.420.410.420.410.410.41
n=10000.32Q0.420.410.410.410.410.41
  T0.420.410.410.410.410.40
  B0.420.410.410.410.410.41

Alternatively, to illustrate the most favorable cases for the non-parametric methodology, we consider the centered Gaussian Exponential model with covariance function:

  • equation image

The parameter values considered are σ2 = 1, and a = 3. The estimated probabilities of G1 and G0 groups, after applying k-mean clustering, with k = 2, over ten generated learning samples of size 1000, are equation image and equation image. The results obtained with this model are displayed in Tables 6 and 7, where a better performance of the functional nonparametric supervised classification against LWV functional discrimination is appreciated. Note that, increasing the functional sample size, stabilization of the mean misclassification error rate is achieved before with the nonparametric supervised classification methodology.

Table 6. Gaussian Exponential process: LWV functional logistic regression and functional nonparametric supervised classification results.
 LWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
  1. a

    The mean LOOCV error rates are shown, from a functional sample of size M=100 curves measured at n=64 times. (Q, T and B respectively correspond to the asymmetrical quadratic, triangular and box kernels).

n=1000.44Q0.350.380.400.410.380.39
  T0.400.370.390.400.360.37
  B0.380.390.390.400.370.38
Table 7. Gaussian Exponential process: LWV functional logistic regression and functional nonparametric supervised classification results.
  1. a

    The mean LOOCV error rates are shown, from functional samples of respective sizes, M=500 curves measured at n=64 times, and M=1000 curves measured at n=64 times. (Q, T and B correspond, respectively, to the asymmetrical quadratic, triangular and box kernels).

Sample sizeLWV approachNonparametric approach
  KernelL2 metricPCA4PCA5PCA6PCA7PCA8
n=5000.46Q0.420.370.370.370.370.37
  T0.420.370.370.370.370.37
  B0.400.370.370.370.370.37
n=10000.45Q0.400.410.410.410.410.41
  T0.400.420.410.410.410.41
  B0.410.410.410.410.400.40

4 Discussion and concluding remarks

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

The role of gene expression levels in the regulation of biological processes like cell cycle, and in the control of disease is crucial. In particular, the stochastic variation of gene expression levels in different cells of the same population under identical growth conditions must be carefully studied, since noise variability can be amplified along the time, as a consequence of decoherence in global gene expression profiles (see, for example, Briones and Bosco, 2009). That is, slow down is appreciated in the decay of the auto-correlation function of yeast genome-wide expression profiles, due to the stochastic component of transcription, leading to fluctuations that tend to be amplified as time progresses. This feature requires the introduction of more singular models than the classical differentiable models associated with smoothing gene expression data. In this paper, a local wavelet-vaguelette-based approach is considered in the representation of gene expression profiles, since it holds for a larger stochastic process class, not necessary stationary, satisfying weaker regularity and covariance moment conditions. Such a class includes processes having slow decay autocorrelation functions and high local singularity. In particular, generalized random fields are also included (see, for instance, Kelbert et al. 2005).

The main improvements achieved with our approach are related to the optimal processing (dimension reduction technique based on uncorrelated random Fourier coefficients) of gene expression profiles, in the nonstationary case. In comparison with other alternative functional classification procedures, like functional nonparametric supervised classification, the best performance of our classification methodology is obtained when fractal patterns are present in gene expression profiles. We highlight the suitability of our approach when discrimination between gene phase regulation is based on the differences observed in the non-stationary local variability of the gene expression curves.

Finally, we point out that MatLab codes generated for implementation of the LWV functional discrimination procedure, and the functional nonparametric supervised classification techniques, applied in the statistical analysis performed for the real data example and simulation study, are provided as Supporting Information.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

This work has been partially supported by projects MTM2009-13393 of the DGI, MEC, and P09-FQM-5052 of the Andalousian CICE, Spain, and by the Fundación para el futuro de Colombia, Colfuturo; and Fundación Carolina.

Conflict of interestThe authors have declared no conflict of interest.

Appendix

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

A.1 Local polynomial modelling to estimate the mean function and the covariance surface

We consider M sample curves as independent realizations of a mean-square integrable stochastic process X(t) on [0, S]. Let yij be the observation of the i-th sample function at time tj. For i = 1,…,M and j = 1,…,n, we define the vectors with entries xij = [1, tjt]T. Also, for i=1,…,M, we consider Xi = [xi1,…,xin]T and equation image where equation image is the Epanechnikov kernel, and h is the bandwidth chosen by cross-validation. The estimator equation image of the mean function μX, based on the M sample curves (Müller, 1987; Wu and Zhang, 2006), is the first component of the vector equation image that minimizes

  • equation image

for equation image, Kh the diagonal block matrix equation image and equation image, with equation image. Thus,

  • equation image

Similarly, let xj,k=[1,tjs,tkt]T for j,k=1,…,n, jk, and KB(x) = K(B−1x)/det(B), with K being the spherical Epanechnikov kernel, and B being a diagonal bandwidth matrix chosen properly by cross-validation. The estimated covariance surface equation image (Fan and Gijbels, 1996) is given by the first component of the vector equation image, specifically

  • equation image

where

  • equation image

with, as before, equation image denoting the empirical covariance function not considering the diagonal.

A.2 Discrete wavelet transform

A multiresolution analysis of equation image is defined as an increasing sequence equation image of closed linear subspaces of equation image with the following properties:

  • (i)
    equation image, equation image,
  • (ii)
    f(2jx) ∈ Vj iff f(x) ∈ V0, for all equation image and equation image;

Particularly V0V1, so ϕ must satisfy the following equation known as the scaling equation:

  • equation image

The mother wavelet ψ also satisfies the equation

  • equation image

where N is an arbitrary odd integer.

From now on, we consider the wavelet decomposition given in terms of a coarsest scale space V0 generated by translations of the father wavelet ϕ, i.e. j0 = 0. The discrete wavelet transform of equation image leads to a vector of wavelet coefficients [ck, dj,k] such that

  • equation image

where equation image with, as before, ϕ being the so called father wavelet, whose dilations and translations generate the bases of the closed subspaces equation image, equation image, involved in the multiresolution analysis of equation image, and equation image are the translations at resolution level j of the mother wavelet ψ, generating a basis of the closed subspace Wj, for each equation image. Note that, for each equation image, equation image in a multiresolution analysis of equation image.

It is also convenient to point out that a Riesz basis equation image in equation image has a unique dual Riesz basis equation image such that equation image, and any function equation image can be represented as

  • equation image

The set equation image is called a biorthogonal sequence.

Further information about discrete wavelet transform and wavelets theory can be found, for example, in Cohen et al. (1992) and Vidakovic (2006), and the references therein.

A.3 Multiresolution-like-analysis

Let X(t), t ∈ [0, S], be a real zero-mean second-order stochastic process. Under weak regularity conditions, the covariance function CX(s, t) = E[X(t)X(s)]−E[X(t)]E[X(s)] can be factorized as follows (see Ruiz-Medina et al. 2003)

  • equation image

where tX denotes the kernel of the integral operator equation image such that the associated covariance operator RX satisfies

  • equation image

The wavelet transform of X(t) leads to a sequence of correlated random wavelet coefficients. To avoid redundancy in such coefficients, the random wavelet-vaguelette decomposition of a random signal is considered (see Angulo and Ruiz-Medina, 1999). The transformed scaling and wavelet bases are constructed as

  • equation image

where equation image is an orthogonal wavelet basis of L2([0,S]). Here, Γ0 and equation image, equation image, denote the sets of translations needed for covering the interval [0, S] at each resolution level. The dual, biorthogonal, basis of (A1) is then defined as

  • equation image

For orthogonal wavelet basis, such as Haar basis, the following biorthogonality condition for the associated wavelet-vaguelette functions then holds:

  • equation image

The projection of X on the above biorthogonal bases leads to the wavelet-vaguelette (i.e. wavelet-like) decomposition

  • equation image

where the random projections equation image, and equation image are uncorrelated, thus, independent in the Gaussian case. Note that the above wavelet-vaguelette decomposition holds, in particular, for fractal processes with high structural local variation, since they are continuous but not differentiable (see Ruiz-Medina et al. 2003).

References

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information
  • Aach, J. and Church, G. M. (2001). Alignment gene expression time series with time warping algorithms. Bioinformatics 17, 495508.
  • Adler, R. J. (1981). The Geometry of Random Fields. Wiley, Chichester.
  • Angulo, J. M. and Ruiz-Medina, M. D. (1999). Multiresolution approximation to the stochastic inverse problem. Advances in Applied Probability 31, 10391057.
  • Berlinet, A., Biau, G. and Rouviére, L. (2008). Functional supervised classification with wavelets. Annales de l'ISUP 52, 6180.
  • Breyne, P. and Zabeau, M. (2001). Genome-wide expression analysis of plant cell cycle modulated genes. Current Opinion in Plant Biology 4, 136142.
  • Briones, M. R. S. and Bosco, F. (2009). Decoherence in yeast cell populations and its implications for genome-wide expression noise. Genetics and Molecular Research 8, 4751.
  • Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis 92, 2441.
  • Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J. and Davis, R. W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 6573.
  • Cho, R. J., Huang, M., Campbell, M. J., Dong, H., Steinmetz, L., Sapinoso, L., Hampton, G., Elledge, S. J., Davis, R. W. and Lockhart, D. J. (2001). Transcriptional regulation and function during the human cell cycle. Nature Genetics 27, 4854.
  • Cohen, A., Daubechies, I. and Feauveau, J. C. (1992). Biorthogonal bases of compactly supported wavelets. Communications on Pure and Applied Mathematics 45, 485560.
  • Dobson, A. J. and Barnett, A. G. (2008). An Introduction to Generalized Linear Models. Chapman & Hall, New York.
  • Draghici, S. (2003). Data Analysis Tools for DNA Microarrays. Chapman & Hall, New York.
  • Eisen, M., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and display genomewide expression patterns. Proceedings of the National Academy of Sciences USA 95, 1486314868.
  • Ewens, W. J. and Grant, G. R. (2005). Statistical Methods in Bioinformatics: An Introduction (Statistics for Biology and Health). Springer, New York.
  • Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications. Chapman & Hall, London.
  • Ferraty, F. and Vieu, P. (2006). Nonparameric Functional Data Analysis. Springer, New York.
  • Hall, P., Poskitt, D. and Presnell, B. (2001). A functional data-analytic approach to signal discrimination. Technometrics 43, 19.
  • Hestilow, T. J. and Huang, Y. (2009). Clustering of gene expression data based on shape similarity. Journal on Bioinformatics and Systems Biology. DOI: 10.1155/2009/195712.
  • Holter, N. S., Maritan, A., Cieplak, M., Fedoroff, N. V. and Banavar, J. R. (2001). Dynamic modeling of gene expression data. Proceedings of the National Academy of Sciences 98, 1693.
  • James, G. M. and Hastie, T. J. (2001). Functional linear discriminant analysis for irregular sampled curves. Journal of the Royal Statistical Society: Series B 63, 533550.
    Direct Link:
  • James, G. M., Wang, J. and Zhu, J. (2009). Functional linear regression, that's interpretable. Annals of Statistics 37, 20832108.
  • Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
  • Kelbert, M., Leonenko, N. N. and Ruiz-Medina, M. D. (2005). Fractional random fields associated with stochastic fractional heat equations. Advances in Applied Probability 37, 108133.
  • Kim, S., Kim, J. K. and Choi, S. (2008). Independent arrays or independent time courses for gene expression time series data analysis. Neurocomputing 71, 2377.
  • Klevecz, R. R. (2000) Dynamic architecture of the yeast cell cycle uncovered by wavelet decomposition of expression microarray data. Functional and Integrative Genomics 1, 186192.
  • Laub, M. T., McAdams, H. H., Feldblyum, T., Fraser, C. M. and Shapiro, L. (2000). Global analysis of the genetic network controlling a bacterial cell cycle. Science 290, 21442148.
  • Leng, X. and Müller, H. G. (2006a). Classification using functional data analysis for temporal gene expression data. Bioinformatics 22, 6876.
  • Leng, X. and Müller, H. G. (2006b). Time ordering of gene co-expression. Biostatistics 7, 569584.
  • Liu, X. L. and Müller, H. G. (2003). Modes and clustering for time-warped gene expression profile data. Bioinformatics 19, 19371944.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, London.
  • Müller, H. (1987). Weighted local regression and kernel methods for nonparametric curve fitting. Journal of the American Statistical Association 82, 231238.
  • Müller, H. G. (2005). Functional modelling and classification of longitudinal data. Scandinavian Journal of Statistics 32, 223240.
  • Müller, H. G., Chiou, J. M. and Leng, X. (2008). Inferring gene expression dynamics via functional regression analysis. BMC Bioinformatics, 60.
  • Müller, H. G. and Stadtmüller, U. (2005). Generalized functional linear models. Annals of Statistics 33, 774805.
  • Peng, X., Karuturi, R. K., Miller, L. D., Lin, K., Jia, Y., Kondu. P., Wang, L., Wong, L. S., Liu, E. T., Balasubramanian, M. K. and Liu, J. (2005). Identification of cell cycle-regulated genes in fission yeast. Molecular Biology of the Cell 16, 10261042.
  • Picard, R. and Cook, D. (1984). Cross-validation of regression models. Journal of the American Statistical Association 79, 575583.
  • Rice, J. and Wu, C. (2000). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57, 253259.
  • Ruiz-Medina, M. D., Angulo, J. M. and Anh, V. V. (2003). Fractional generalized random fields on bounded domains. Stochastic Analysis and Applications 21, 465492.
  • Rustici, G., Mata, J., Kivinen, K., Lió, P., Penkett, C. J., Burns, G., Hayles, J., Brazma, A., Nurse, P. and Bähler, J. (2004). Identification of cell cycle-regulated genes in fission yeast. Nature Genetics 36, 809817.
  • Schuchhardt, J., Beule, D., Wolski, E. and Eickhoff, H. (2000). Normalization strategies for cDNA microarrays. Nucleic Acids Research 28, E47.
  • Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall, New York.
  • Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell-cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hibridization. Molecular Biology of the Cell 9, 32733297.
  • Van Der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with parametric bootstrap. Biostatistics 2, 445461.
  • Vidakovic, B. (2006). Statistical Modelling by Wavelets. Wiley, New York.
  • Wang, H. Q. and Huang, D. S. (2005). A gene selection algorithm based on the gene regulation probability using maximal likelihood estimation. Biotechnology Letters 27, 597603.
  • Wu, P. S. and Müller, H. G. (2010). Functional embedding for the classification of gene expression profiles. Bioinformatics 26, 509517.
  • Wu, H. and Zhang, J. T. (2006). Nonparametric Regression Methods for Longitudinal Data Analysis. Wiley, New Jersey.
  • Yao, F., Müler, H. G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A. and Vogel, J. S. (2003). Shrinkage estimation for functional principal component scores, with application to the population kinetics of plasma folate. Biometrics 59, 676685.
  • Zhao, X., Marron, J. S. and Wells, M. T. (2004). The functional data analysis view of longitudinal data. Statistica Sinica 14, 789808.

Supporting Information

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Methods
  5. 3 Analysis of yeast cell-cycle gene expression profiles
  6. 4 Discussion and concluding remarks
  7. Acknowledgements
  8. Appendix
  9. References
  10. Supporting Information

Detailed facts of importance to specialist readers are published as ”Supporting Information”. Such documents are peer-reviewed, but not copy-edited or typeset. They are made available as submitted by the authors.

FilenameFormatSizeDescription
bimj_201000135_sm_SupplInfo.zip678KSupplInfo

Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.