The error rates obtained with each kernel and semi-metric are shown.
Research Article
Local wavelet-vaguelette-based functional classification of gene expression data
Article first published online: 23 DEC 2011
DOI: 10.1002/bimj.201000135
Copyright © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Issue
1521-4036/asset/cover.gif?v=1&s=44244e92ebd601b98ef501afeb3e1ab7a016a4c3)
Biometrical Journal
Special Issue: Survival and Event History Analysis
Volume 54, Issue 1, pages 75–93, January 2012
Additional Information
How to Cite
Rincón Hidalgo, M. M. and Ruiz-Medina, M. D. (2012), Local wavelet-vaguelette-based functional classification of gene expression data. Biom. J., 54: 75–93. doi: 10.1002/bimj.201000135
Publication History
- Issue published online: 5 JAN 2012
- Article first published online: 23 DEC 2011
- Manuscript Accepted: 8 SEP 2011
- Manuscript Revised: 11 MAR 2011
- Manuscript Received: 2 JUL 2010
Keywords:
- Functional data;
- Functional logistic regression;
- Gene expression profile;
- Local wavelet-vaguelette decomposition;
- Yeast cell-cycle gene expression data
Abstract
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
This paper focuses on the problem of functional statistical classification of gene expression curves. A local-wavelet-vaguelette-based functional logistic regression approach is presented. This approach is specially suitable for the classification of non-stationary singular (non-differentiable) curves. The performance of the methodology proposed is illustrated by implementing it for the classification of yeast cell-cycle temporal gene expression profiles. A simulation study is also carried out for comparison with other functional classification methodologies.
1 Introduction
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
Several researchers have been devoted to detect and quantify gene expression levels under different scenarios (biological cells). In the recent literature, microarray techniques have reached an important role in the molecular biology research (see, for example, Draghici, 2003; Speed, 2003; Ewens and Grant, 2005). Local variation is an important feature in gene expression microarray data, since under fixed experimental conditions, gene expression levels are given by different quantification values, induced by the external factors acting at the different steps of the biological and statistical procedures involved (see, for instance, Schuchhardt et al., 2000). The design of statistical classification procedures, involving an optimal approximation of gene expression local variability, then becomes crucial. This fact motivates the formulation of the local wavelet-vaguelette-based (LWV-based) functional statistical classification procedure proposed in this paper. The local fitting of the LWV sample projections of gene expression curves leads to a more accurate reproduction of their non-stationary local variation. In the approach presented in this paper, genes expression profiles are interpreted as the trajectories of a continuous Gaussian process. The LWV transform provides a non-redundant description of the second-order (i.e. distributional) properties of a Gaussian stochastic process, in terms of independent random coefficients. The proposed functional logistic discrimination procedure is implemented in terms of the LWV sample projections of gene expression curves. Dimension reduction is achieved by truncation of the infinite-dimensional series defining LWV expansion.
Alternative functional statistical classification methodologies can be found in Rice and Wu (2000); Aach and Church (2001); Hall et al. (2001); James and Hastie (2001); Liu and Müller (2003); Yao et al, 2003; Zhao et al. (2004); Cardot and Sarda (2005); Müller (2005); Ferraty and Vieu (2006); Leng and Müller (2006a, b); Berlinet et al. (2008); Müller et al. (2008), among others. In particular, wavelet bases for dimension reduction have been considered in Berlinet et al. (2008), where data-driven thresholding techniques are applied in a general setting. Here, the selection problem associated with the choice of an optimal and consistent discrimination procedure, from a given family of classifiers, is addressed in the context of inifinite-dimensional explanatory variables and binary responses. The other references cited study alternative bases of functions. For example, Leng and Müller (2006a) apply functional logistic regression in terms of the eigenfunctions of the covariance operator of the explanatory variables, while dimension reduction is achieved in terms of normalized B-splines in Cardot and Sarda (2005) considering the exponential setting. Recently, classical Multidimensional Scaling has been applied for the dimension reduction in Wu and Müller (2010), with the aim of finding a low-dimensional configuration of the observed high-dimensional subjects by retaining a given pairwise distance (dissimilarity).
Gene expression analysis plays, in particular, a crucial role in the study and control of the cell-cycle (see Cho et al., 1998; Spellman et al., 1998; Laub et al., 2000; Breyne and Zabeau, 2001; Cho et al., 2001; Rustici et al., 2004; Peng et al., 2005; Kimet al. 2008, among others). In this context, different levels of local structural variability are detected (see Klevecz, 2000, on wavelet analysis of yeast cell-cycle gene expression data and mRNA levels; Holter et al., 2001, on singular-value-based gene expression dynamic models associated with the cell-cycle; Briones and Bosco, 2009, on genome-wide expression noise and decoherence in global gene expression profiles, among others).
Frequently, misclassification problems arise when genes related to different phases display similar expression patterns, e.g. genes associated with phase transitions. Optimal projection methods must then be selected to obtain a reliable reproduction of the non-homogeneous local variation properties of gene profiles. For example, in Leng and Müller (2006a), gene expression curves with high local variation levels are identified as outliers, when Functional Principal Component Analysis (FPCA) is applied. The distributional characteristics of the corresponding random projections do not depend on time (global model fitting). Therefore, the local structural variability induced by the non-homogeneous variance of the underlying stochastic model is not reflected in the approximation provided by FPCA. However, the local fitting performed with our approach, where sample projections depend on time, leads to a closer reproduction of local variation properties. In particular, high local singularity in gene expression curves can be suitably processed in terms of the transformed wavelet bases involved in the LWV decomposition. These two key features of our approach induce an improvement of the misclassification rate of G1 -phase genes obtained in Leng and Müller (2006a) by applying FPCA to the data set provided by Spellman et al. (1998) (see Section 3). Note that G1 phase constitutes a promising target for the research and treatment of cancer. Additionally, important genes associated with G1 regulation have been shown to play a key role in proliferation, differentiation and oncogenic transformation and programmed cell death (apoptosis). The classification results derived from the application of our approach to the data set obtained by Spellman et al. (1998) are compared with the discrimination results established from functional nonparametric supervised classification (see Section 3.1), in terms of different distances and kernels as given in Ferraty and Vieu (2006). A simulation study is also developed, providing an illustration of different model scenarios where the proposed LWV functional classification can or cannot outperform the nonparametric functional discrimination methodology (see Section 3.2).
1.1 Approach
As commented in Section 1, the aim of this paper is to provide a functional statistical classification procedure for discrimination between genes related to G1 phase and genes associated with other regulation phases (non-G1 genes). From now on, genes related to G1 phase will be referred as genes in G1 group, and the rest of regulation-phases considered are associated with genes in G0 group. The main steps involved in the proposed LWV-based functional logistic classification methodology are the following:
Step 1. Estimate the mean and covariance functions from the observed gene expression profiles.
Step 2. Compute the empirical LWV decomposition of gene expression curves.
Step 3. Estimate, by iterated weighted least squares, the LWV coefficients of the parameter function associated with the infinite-dimensional generalized linear model considered.
Step 4. Compute the parametric estimate of the conditional probability of belonging to G1 phase for each gene expression curve in the sample. Its classification is then established, according to prior probabilities given for each group.
2 Methods
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
Gene expression profiles are interpreted as independent realizations of a mean-square integrable Gaussian process X(t) on [0, S]. A functional sample of size M of gene expression curves is considered. Specifically, the observation of the i-th sample function at time th is denoted by Xi(th), for h = 1,…,n, and i = 1,…,M. The mean function μX(t) is approximated by a local polynomial kernel estimator
, computed from the longitudinal data (see Appendix, and Muller, 1987, Wu and Zhang, 2006). Epanechnikov kernel is chosen for the estimation of the mean function, and spherical Epanechnikov kernel for the estimation of the covariance function. In particular, a multivariate locally weighted least square kernel estimator
of the covariance function CX(s, t), based on spherical Epanechnikov kernel (see Fan and Gijbels, 1996, and Appendix), is computed from the empirical covariance function
given by
for h≠m, h, m = 1,…,n. Such an estimator,
, is evaluated on a grid with N = 2p,
, equally spaced points in [0, S]. The diagonal elements σ2(t) = CX(t, t), t ∈ [0, S], are approximated by interpolation in terms of
, t ∈ [0, S].
Remark 2.1 Parameter p is selected according to the local regularity properties (respectively, local singularity features) of the gene expression profiles. Thus, large values of p are needed when function X displays high local singularity, i.e. high local variability that must be collected at the microscale level. The regularity properties of the wavelet functions, as well as the number of their vanishing moments also affect the selection of p. Without prior information on the local variability features of X, p can be chosen by cross-validation. On the other hand, in the kernel-based nonparametric estimation of the covariance function of the functional explanatory variables (gene expression profiles), the bandwidth parameter must also reflect the local regularity (respectively, local singularity features) of the data. That is, small values of the bandwidth parameter are selected when high local variability is displayed by the gene expression curves. Note that, in the Gaussian case considered, there exists a direct connection between local regularity properties of the sample curves and local regularity of the covariance function of the explanatory variables (see Adler, 1981). Parameter p and the bandwidth are then closely related, since when p increases the bandwidth must decrease. In the particular case where the mother wavelet displays similar regularity properties, and vanishing moment conditions to the Epanechnikov kernel, an explicit inverse relationship can be obtained between p and the bandwidth parameter. When prior information on the local regularity of the values of the functional explanatory variables is not available, cross-validation methodology is applied for testing the bandwidth parameter fitting.
2.1 Multiresolution-like analysis
The LWV decomposition is based on the factorization of the covariance function. The empirical eigenvalues
, and the corresponding empirical eigenvectors
, of the estimated covariance matrix
, allow us to construct the empirical kernel
factorizing the covariance function
as follows (see Appendix where the description of the theoretical approach is given):
for h, m = 1,…,N. The kernel lX defining the inverse
of operator
, with kernel tX, can then be formally approximated as
for h,m = 1,…,N. The empirical LWV functions are then computed in terms of kernels
,
, and a given orthonormal wavelet basis. We have chosen Haar system with the father wavelet, ϕ(x) = I[0,1)(x), and the mother wavelet, ψ(x) = I[0,1/2)(x)−I[1/2,1)(x) (see Vidakovic, 2006). In the following, for certain S>0, and for each x ∈ [0,S], denote, for
, by
the translated Haar father wavelet ϕ at resolution level j0, and, for k = 1,…,N(j), j≥j0, denote by
the translated mother wavelet ψ at resolution level j (see Appendix A.2).
Remark 2.2. The wavelet basis must be selected according to the local regularity and temporal correlation properties characterizing gene expression profiles. We consider Daubechies wavelet family because of its desirable properties in relation to our approach. Specifically, they are orthogonal and compactly supported. The first feature ensures the uncorrelation of the LWV coefficients (see Appendix), providing a non-redundant description of gene expression curves. Moreover, the compact support of these wavelet functions avoids border effects in our analysis. Finally, we will highlight the easy identification of their local regularity properties and moment conditions (support length) in terms of a unique parameter, which constitutes a crucial feature in our approach for appropriate selection of a wavelet basis. Specifically, we consider the empirical Hölder spectra of gene expression profiles for identification of the suitable order V of vanishing moments, characterizing the basis to be selected from the wavelet Daubechies family. Along Section 3, in the statistical analysis of the data set obtained by Spellman et al. (1998), and in the simulation study developed in Section 3.2, Haar wavelet basis is considered. Therefore, we will refer to this basis in the subsequent development.
For each t ∈ [0, S], and h = 1,…,N, denoting, as before, by j0, the resolution level selected as coarsest scale, we have
In matrix form, for each t ∈ [0, S], we denote, for
, by
the vector with entries 
, given by the product of the matrix
, with
, for h,m = 1,…,N, and the vector
, with
, for m = 1,…,N. Similarly, for j = j0,…,p−1, denote by
the matrix with entries
, for h = 1,…,N, k = 0,…,2j−1, where
,
, and where, for each
,
is the vector with entries lm,k = ψj,k(tm), for m=1,…,N. Additionally, for each
, the vector
has entries
,
, and, for each j = j0,…,p−1, the matrix
has entries
, for m = 1,…,N, and k=0,…,2j−1, with
. The following local empirical coefficients are computed at time t ∈ [0, S], for each sample curve Xi,
(1)
where
denotes the locally re-scaled empirical dual Riesz basis of
(see Appendix). The M sample curves can then be approximated in terms of the following empirical LWV decomposition: For i = 1,…,M, and for each t ∈ [0, S]
(2)
This decomposition will be considered in the implementation of the functional logistic regression in the following section.
2.2 Functional logistic regression
Generalized Linear Models (GLM) (see, for example, McCullagh and Nelder, 1989; Dobson and Barnett, 2008) constitute a flexible extension of classical linear models where the response variables, Y1,…,YM, are independent and identically distributed (i.i.d.), with probability distribution in the exponential family. The logistic regression model is defined in terms of a set of parameters β1, …,βH, the explanatory variables
, the logit link function g(x) = log(x/(1−x)), such that E(Yi∣xi1,…,xiH) = μiY = g−1(ηi), where ηi = α + ∑ xijβj, i = 1,…,M, with α being a constant. The response variable Yi is then conditionally distributed as a Bernoulli with mean μiY and variance μiY(1−μiY), for i = 1,…,M.
In the functional formulation of the above-described logistic regression model (see, for example, Müller and Stadtmüller, 2005), the parameter β is a square integrable function, and the explanatory variables Xi, i = 1,…,M, are functional variables with values in a separable Hilbert space H, e.g. H = L2([0, S]), with [0, S], as before, being a real interval, and
(3)
where α is a constant. The linear predictors ηi, i = 1,…,M, define the conditional mean and variance, E(Yi∣Xi(t)) = μiY = g−1(ηi) and Var(Yi∣Xi(t)) = σ2(μiY) = μiY(1−μiY), for each binary response variable Yi, i = 1, …,M, through the logit function g. Thus,
where the errors ei, i = 1,…,M, are considered to be independent random variables with zero-mean and finite variance.
Due to the square integrability of β, this parameter function admits the local decomposition:
(4)
in terms of the locally scaled dual Riesz bases
and
. In the development below, the local Fourier coefficients of parameter function β, with respect to the empirical locally scaled basis
, will be denoted as
, for
, and
, for
,
, and for each t ∈ [0, S]. The previous derived approximations of
from (2), and of β(t) from (4), in terms of the empirical LWV dual Riesz bases, lead to the following local approximation
of ηi, for i = 1,…,M,
The functional model is then locally reduced to a generalized linear model (see McCullagh and Nelder, 1989; Dobson and Barnett, 2008), for each t ∈ [0, S], where the parameters
are estimated by solving the score equation:
(5)
where σ2(μiY) = μiY(1−μiY),
denotes the transpose of the vector of Fourier coefficients of Zi(t) with respect to the empirical LWV basis
. Define now the vector
(6)
A prior probability p0 is considered for G0 memberships, and similarly, a prior probability p1 is considered for G1 memberships. Thus, if for the i-th observation Xi(t),
, the sample curve Xi(t) is a member of G1. Otherwise, it belongs to G0. Here,
where
(7)
Remark 2.3 The approximation of p1 and p0 can be achieved from the respective mean proportions of sample genes clustered into each group, G1 and G0, after applying supervised or unsupervised clustering over a set of learning samples. The main drawback in the application of supervised clustering is that two target or reference gene expression groups must be previously established. The gene target subset selection problem has been studied from different perspectives, for example, Van Der Laan and Bryan (2001) propose a deterministic rule applied to the parameters of the gene expression distribution, previously estimated by parametric bootstrap, under the Gaussian assumption. More sophisticated methods related to sample-to-group regulation probability estimation can also be considered (see Wang and Huang, 2005, who apply maximum likelihood criterion for estimation) in the supervised organization of gene expression profiles into two groups.
When no prior information is available on the number of members of each group, G1 and G0, unsupervised clustering can be applied (see Kaufman and Rousseeuw, 1990; Eisen et al., 1998, among others). In Section 3.2, k-mean clustering, with k = 2, based on standard correlation coefficient, will be implemented to organize into G1 and G0 groups the genes with similar expression patterns. Specifically, denoting by N0 the number of genes clustered into G0 group, and by N1 the number of genes clustered into G1 group, the respective probabilities p0 and p1 of belonging to each group are approximated as follows:
(8)
Note that unsupervised clustering, as a data exploratory analysis tool, provides a preliminary organization of data for further posterior statistical analysis (or discrimination). It is worthwhile noted that any unsupervised clustering based on shape similarity can be applied for the approximation of the group probabilities, p0 and p1, from Eq. (8) in Section 3.2 (see, for example, Hestilow and Huang, 2009).
3 Analysis of yeast cell-cycle gene expression profiles
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
We consider the data set obtained by Spellman et al. (1998) consisting of M = 90 gene expression profiles (α factor synchronized) involved in the yeast cell-cycle regulation. The gene expression is measured every 7 min between 0 and S = 119 min (both time instants included). Thus, n = 18 observations are available for each gene. Since it is known that 44 of these genes are related to G1 phase regulation, and 46 to the S, S/G2, G2/M and M/G1 phases, Eq. (8) is applied with N0 = 46, and N1 = 44.
Figure 1 displays the original data with their approximation in terms of the LWV decomposition, considering a grid with N = 64 = 26 equally spaced time points. Convergence of the iterated weighted least squares algorithm is achieved for every point t on the grid, after 100 iterations controlled by the deviance. Estimate
for N=64 equally spaced time points, and the mean vector over time,
, are displayed in Fig. 1.

Figure 1. Left panel: Temporal gene expression profiles of yeast cell cycle. Right panel: Reconstruction of the temporal gene expression profiles in left panel using an orthogonal expansion from wavelet-vaguelette transform. Dashed lines: Genes peaking in G1 phase; gray solid lines genes peaking in non-G1 phase; Black solid line: Estimated mean curve.
In order to measure the accuracy of the model, the Leave-One-Out Cross-Validation (LOOCV) error rate is obtained (see Picard and Cook, 1984). Suppose the i-th gene is missing, the estimates of the mean, covariance function, and the parameter vector β are computed from a functional sample of 89 genes expression curves. These estimates are tested considering the approximation η−i of η, based on the training sample conformed by 89 gene expression curves, after removing the i-th gene. For each t ∈ [0, S], the LWV coefficients of the i-th gene, validation data, are computed as follows:
where
,
and
,
,
, are constructed from the estimated mean function
, and the empirical covariance function computed from the training set constituted by 89 genes (see Eq. (1)). This procedure is repeated with every gene, i.e. for
if
, the i-th gene is member of G1, otherwise it is considered as a member of G0. Here,
is obtained from the training sample,
is defined as in (6), considering the score equation (5) based on the training set, and
is given as in Eq. (7) from the i-th gene, for i = 1,…,M. The LOOCV error rate is computed as the quotient between the total number of misclassified genes under LOOCV, and the total number of genes. High accuracy of the model is assured by an LOOCV error rate CVE = 0.1, corresponding to three misclassified G1 genes plus six misclassified G0 genes.
The three misclassified G1 genes with our approach are: YCL055W, YDL227C, and YJL092W, while with the approach presented in Leng and Müller (2006a) were five: YCL055W, YDR113C, YDR356W, YJL092W and YDL055C. Regarding the common misclassified genes with the two approaches, YCL055W (KAR4) and YJL092W (SRS2), in Spellman et al., (1998) is commented that the first one should be “induced by α factor and so it is very strongly expressed at the beginning of the α factor experiment”. Leng and Müller (2006a) suggest that SRS2 is close to the trajectories of S genes, and KAR4 is close to M/G1 phase regulation as we can see in Fig. 3. On the other hand, the misclassified gene profile with our approach, YDL227C, is upregulated at G1 late phase. Thus, its association with a phase transition of the cell cycle makes its discrimination procedure more difficult. Its expression pattern is similar to S genes. Moreover, a missing expression value of this profile at time t = 7 min increases the difficulty of the discrimination procedure. This fact affects more our classification results, since our (time-dependent) local fitting of the wavelet-vaguelette sample projections of gene profiles is less robust against missing data. (Note that in Leng and Müller (2006a) gene sample projections are independent of time.) This is the price we have to pay for improving the approximation of the local variation of gene trajectories, in order to get a better discrimination based on its expression patterns.
3.1 Comparison with a functional nonparametric supervised classification
In this section, the nonparametric functional statistical classification methodology introduced in Ferraty and Vieu (2006) is applied to the data set obtained by Spellman et al. (1998). This classification procedure is based on the Bayes rule. Given a functional object
in E, a Euclidean space, the purpose is to estimate the posterior probabilities of belonging to each group Gj, for j = 1,…,g, i.e. to estimate the posterior probabilities of belonging to each element of a given set of groups
Such probabilities are defined by
, for j = 1,…,g. Once these posterior probabilities are estimated
, the classification rule consists of assigning an incoming functional observation
to the class with highest estimated posterior probability:
In Ferraty and Vieu (2006), the following nonparametric kernel estimators pGjLCV(·),
of the probabilities of belonging to each group are introduced:
where, for i = 1,…,M, xi is considered as the gene-expression time-course from the i-th gene consisting of 18 observations, as we pointed out before. Here, K is an asymmetrical kernel, d is a semi-metric, i0 = arg mini = 1,…,Md(x,xi) and hLCV(xi0) is the bandwidth corresponding to the optimal number of neighbors at xi0, obtained by the cross-validation procedure described in Ferraty and Vieu (2006). For j = 1,…,g, the estimator of
is a functional version of a local polynomial kernel smoother considering a zero-degree polynomial, i.e. the Nadaraya–Watson estimator (see Appendix in relation to the mean and covariance estimation from local polynomial modeling). Since classification is performed with the Bayes rule, considering
we have that
>
means that the m-th observation comes from G0, otherwise it comes from G1. Table 1 shows the LOOCV error rate using quadratic, triangle and box asymmetrical kernels. Two empirical versions of the semi-metrics given by the classical L2-metric, and the principal component analysis (PCA) based semi-metric, dqPCA, with
| L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | |
|---|---|---|---|---|---|---|
| ||||||
| Quadratic | 26/90 | 18/90 | 20/90 | 28/90 | 23/90 | 26/90 |
| Triangle | 27/90 | 18/90 | 20/90 | 27/90 | 24/90 | 26/90 |
| Box | 24/90 | 18/90 | 20/90 | 28/90 | 25/90 | 28/90 |
are considered. Here, [vk]j is the j-th component of the k-th eigenvector of the covariance matrix
, corresponding to the k-th eigenvalue λk, with
. The results displayed in Table 1 show that the best performance corresponds to the semi-metric defined from PCA considering four eigenvectors, denoted as PCA4. Note that the results obtained from the application of LWV-based functional logistic classification lead to an LOOCV error rate, CVE = 0.1, which outperforms the nonparametric functional classification in all the cases studied (Fig. 2).

Figure 2. Components of
for 64 equally spaced points t ∈ [0, 119], the coarsest line is the mean vector
. Note that the first coefficients produce the biggest influence in the response variable.

Figure 3. Profiles of misclassified G1 genes. Left panel: Gray solid lines: G1 genes; dashed lines: S genes; black solid line: misclassified gene SRS2 (YJL092W). Right panel: Gray solid lines: G1 genes; dashed lines: M/G1 genes; black solid line: misclassified gene KAR4 (YCL055W).
3.2 Simulation study
To investigate the general performance of the methodology proposed, we have carried out a simulation study in terms of some well-known families of stochastic process. The proposed LWV-based functional logistic discrimination is applied. The results obtained are compared with those derived from the application of nonparametric supervised classification.
First, let us consider fractional Brownian motion as the basic model for generating gene expression profiles. Specifically, M gene expression profiles of length N = 26, from the sample paths of fractional Brownian motion with Hurst parameter 0.3, are generated. The Haar wavelet transform is then applied. The selected coarsest scale j0 (i.e. lowest resolution level) is j0 = 6. Thus, 26 translations of the Haar farther wavelet ϕ are considered for generating the draft of each gene expression profile in the space V0 (see, for instance, James et al. 2009). The parameter function is given by:
Then, η sample values are generated from the inner product of the simulated gene expression profiles and the selected β function, for a fixed value of parameter α in (3). The probabilities
are computed. The response vector components Yi, i = 1,…,M, are obtained from Bernoulli distributions with respective probabilities πi, i = 1,…,M.
The LWV functional logistic regression and the functional nonparametric supervised classification are applied, considering functional samples of size M = 100, 500, 1000 gene expression profiles, which are generated at n = 64 time points. Specifically, the mean LOOCV error rates obtained after applying LWV functional logistic regression and nonparametric supervised classification are displayed in Table 2 for functional samples of size M = 100, and in Table 3 for functional samples of sizes M = 500 and M = 1000. In the application of the proposed LWV methodology, the group probabilities are previously estimated as the mean proportion of G1 and G0 memberships over ten generated learning samples of size 1000, after applying k-mean clustering, with k = 2. The corresponding estimated values obtained are
and
.
| LWV approach | Nonparametric approach | |||||||
|---|---|---|---|---|---|---|---|---|
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| ||||||||
| n=100 | 0.28 | Q | 0.39 | 0.48 | 0.46 | 0.47 | 0.47 | 0.48 |
| T | 0.42 | 0.47 | 0.45 | 0.47 | 0.46 | 0.48 | ||
| B | 0.4 | 0.47 | 0.46 | 0.48 | 0.47 | 0.48 | ||
| Sample size | LWV approach | Nonparametric approach | ||||||
|---|---|---|---|---|---|---|---|---|
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| ||||||||
| n=500 | 0.13 | Q | 0.38 | 0.46 | 0.47 | 0.47 | 0.45 | 0.46 |
| T | 0.4 | 0.45 | 0.45 | 0.46 | 0.44 | 0.45 | ||
| B | 0.38 | 0.46 | 0.46 | 0.47 | 0.45 | 0.46 | ||
| n=1000 | 0.1 | Q | 0.37 | 0.47 | 0.47 | 0.47 | 0.47 | 0.47 |
| T | 0.4 | 0.45 | 0.45 | 0.45 | 0.45 | 0.46 | ||
| B | 0.39 | 0.47 | 0.47 | 0.46 | 0.46 | 0.47 | ||
It can be observed that LWV functional logistic regression provides better results than nonparametric supervised classification, in the case where fractal expression patterns, displaying high local variability, are studied. Note that fractional Brownian motion constitutes an example of Gaussian process displaying fractal features and long-range dependence due to its self-similar behavior. The better performance of the functional LWV classification procedure against functional nonparametric supervised discrimination is now tested for fractal and long-range dependence Gaussian models, in the non-self-similar case, where micro-scale and macro-scale properties do not coincide. Since, as commented, LWV provides a local fitting in terms of wavelet-like functions suitably reproducing the local self-similar behavior of fractal processes, LWV functional logistic regression also outperforms nonparametric supervised classification in the fractal and long-range dependence non-self-similar case. This fact is illustrated with the following example, given in terms of the centered Gaussian Linnik process with fractal covariance function:
We have considered the parameter values α = 1/3 and σ = 25. The estimated probabilities of G1 and G0 groups, after applying k-mean clustering, with k = 2, over ten generated learning samples of size 1000, are
and
.
The results obtained after applying LWV and functional nonparametric classification methodologies are displayed in Tables 4 and 5. It can be appreciated that LWV outperforms functional nonparametric supervised classification in all the cases and functional samples considered of respective sizes M = 100, 500, 1000.
| LWV approach | Nonparametric approach | |||||||
|---|---|---|---|---|---|---|---|---|
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| ||||||||
| n=100 | 0.42 | Q | 0.43 | 0.45 | 0.46 | 0.46 | 0.46 | 0.45 |
| T | 0.43 | 0.45 | 0.45 | 0.46 | 0.46 | 0.45 | ||
| B | 0.43 | 0.45 | 0.46 | 0.47 | 0.46 | 0.45 | ||
| Sample size | LWV approach | Nonparametric approach | ||||||
|---|---|---|---|---|---|---|---|---|
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| ||||||||
| n=500 | 0.35 | Q | 0.43 | 0.41 | 0.42 | 0.41 | 0.41 | 0.41 |
| T | 0.43 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | ||
| B | 0.42 | 0.41 | 0.42 | 0.41 | 0.41 | 0.41 | ||
| n=1000 | 0.32 | Q | 0.42 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 |
| T | 0.42 | 0.41 | 0.41 | 0.41 | 0.41 | 0.40 | ||
| B | 0.42 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | ||
Alternatively, to illustrate the most favorable cases for the non-parametric methodology, we consider the centered Gaussian Exponential model with covariance function:
The parameter values considered are σ2 = 1, and a = 3. The estimated probabilities of G1 and G0 groups, after applying k-mean clustering, with k = 2, over ten generated learning samples of size 1000, are
and
. The results obtained with this model are displayed in Tables 6 and 7, where a better performance of the functional nonparametric supervised classification against LWV functional discrimination is appreciated. Note that, increasing the functional sample size, stabilization of the mean misclassification error rate is achieved before with the nonparametric supervised classification methodology.
| LWV approach | Nonparametric approach | |||||||
|---|---|---|---|---|---|---|---|---|
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| ||||||||
| n=100 | 0.44 | Q | 0.35 | 0.38 | 0.40 | 0.41 | 0.38 | 0.39 |
| T | 0.40 | 0.37 | 0.39 | 0.40 | 0.36 | 0.37 | ||
| B | 0.38 | 0.39 | 0.39 | 0.40 | 0.37 | 0.38 | ||
| ||||||||
| Sample size | LWV approach | Nonparametric approach | ||||||
| Kernel | L2 metric | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | ||
| n=500 | 0.46 | Q | 0.42 | 0.37 | 0.37 | 0.37 | 0.37 | 0.37 |
| T | 0.42 | 0.37 | 0.37 | 0.37 | 0.37 | 0.37 | ||
| B | 0.40 | 0.37 | 0.37 | 0.37 | 0.37 | 0.37 | ||
| n=1000 | 0.45 | Q | 0.40 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 |
| T | 0.40 | 0.42 | 0.41 | 0.41 | 0.41 | 0.41 | ||
| B | 0.41 | 0.41 | 0.41 | 0.41 | 0.40 | 0.40 | ||
4 Discussion and concluding remarks
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
The role of gene expression levels in the regulation of biological processes like cell cycle, and in the control of disease is crucial. In particular, the stochastic variation of gene expression levels in different cells of the same population under identical growth conditions must be carefully studied, since noise variability can be amplified along the time, as a consequence of decoherence in global gene expression profiles (see, for example, Briones and Bosco, 2009). That is, slow down is appreciated in the decay of the auto-correlation function of yeast genome-wide expression profiles, due to the stochastic component of transcription, leading to fluctuations that tend to be amplified as time progresses. This feature requires the introduction of more singular models than the classical differentiable models associated with smoothing gene expression data. In this paper, a local wavelet-vaguelette-based approach is considered in the representation of gene expression profiles, since it holds for a larger stochastic process class, not necessary stationary, satisfying weaker regularity and covariance moment conditions. Such a class includes processes having slow decay autocorrelation functions and high local singularity. In particular, generalized random fields are also included (see, for instance, Kelbert et al. 2005).
The main improvements achieved with our approach are related to the optimal processing (dimension reduction technique based on uncorrelated random Fourier coefficients) of gene expression profiles, in the nonstationary case. In comparison with other alternative functional classification procedures, like functional nonparametric supervised classification, the best performance of our classification methodology is obtained when fractal patterns are present in gene expression profiles. We highlight the suitability of our approach when discrimination between gene phase regulation is based on the differences observed in the non-stationary local variability of the gene expression curves.
Finally, we point out that MatLab codes generated for implementation of the LWV functional discrimination procedure, and the functional nonparametric supervised classification techniques, applied in the statistical analysis performed for the real data example and simulation study, are provided as Supporting Information.
Acknowledgements
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
This work has been partially supported by projects MTM2009-13393 of the DGI, MEC, and P09-FQM-5052 of the Andalousian CICE, Spain, and by the Fundación para el futuro de Colombia, Colfuturo; and Fundación Carolina.
Conflict of interestThe authors have declared no conflict of interest.
Appendix
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
A.1 Local polynomial modelling to estimate the mean function and the covariance surface
We consider M sample curves as independent realizations of a mean-square integrable stochastic process X(t) on [0, S]. Let yij be the observation of the i-th sample function at time tj. For i = 1,…,M and j = 1,…,n, we define the vectors with entries xij = [1, tj−t]T. Also, for i=1,…,M, we consider Xi = [xi1,…,xin]T and
where
is the Epanechnikov kernel, and h is the bandwidth chosen by cross-validation. The estimator
of the mean function μX, based on the M sample curves (Müller, 1987; Wu and Zhang, 2006), is the first component of the vector
that minimizes
for
, Kh the diagonal block matrix
and
, with
. Thus,
Similarly, let xj,k=[1,tj−s,tk−t]T for j,k=1,…,n, j≠k, and KB(x) = K(B−1x)/det(B), with K being the spherical Epanechnikov kernel, and B being a diagonal bandwidth matrix chosen properly by cross-validation. The estimated covariance surface
(Fan and Gijbels, 1996) is given by the first component of the vector
, specifically
where
with, as before,
denoting the empirical covariance function not considering the diagonal.
A.2 Discrete wavelet transform
A multiresolution analysis of
is defined as an increasing sequence
of closed linear subspaces of
with the following properties:
- (i)
,
, - (ii)f(2jx) ∈ Vj iff f(x) ∈ V0, for all
and
;
Particularly V0 ⊂ V1, so ϕ must satisfy the following equation known as the scaling equation:
The mother wavelet ψ also satisfies the equation
where N is an arbitrary odd integer.
From now on, we consider the wavelet decomposition given in terms of a coarsest scale space V0 generated by translations of the father wavelet ϕ, i.e. j0 = 0. The discrete wavelet transform of
leads to a vector of wavelet coefficients [ck, dj,k] such that
where
with, as before, ϕ being the so called father wavelet, whose dilations and translations generate the bases of the closed subspaces
,
, involved in the multiresolution analysis of
, and
are the translations at resolution level j of the mother wavelet ψ, generating a basis of the closed subspace Wj, for each
. Note that, for each
,
in a multiresolution analysis of
.
It is also convenient to point out that a Riesz basis
in
has a unique dual Riesz basis
such that
, and any function
can be represented as
The set
is called a biorthogonal sequence.
A.3 Multiresolution-like-analysis
Let X(t), t ∈ [0, S], be a real zero-mean second-order stochastic process. Under weak regularity conditions, the covariance function CX(s, t) = E[X(t)X(s)]−E[X(t)]E[X(s)] can be factorized as follows (see Ruiz-Medina et al. 2003)
where tX denotes the kernel of the integral operator
such that the associated covariance operator RX satisfies
The wavelet transform of X(t) leads to a sequence of correlated random wavelet coefficients. To avoid redundancy in such coefficients, the random wavelet-vaguelette decomposition of a random signal is considered (see Angulo and Ruiz-Medina, 1999). The transformed scaling and wavelet bases are constructed as
where
is an orthogonal wavelet basis of L2([0,S]). Here, Γ0 and
,
, denote the sets of translations needed for covering the interval [0, S] at each resolution level. The dual, biorthogonal, basis of (A1) is then defined as
For orthogonal wavelet basis, such as Haar basis, the following biorthogonality condition for the associated wavelet-vaguelette functions then holds:
The projection of X on the above biorthogonal bases leads to the wavelet-vaguelette (i.e. wavelet-like) decomposition
where the random projections
, and
are uncorrelated, thus, independent in the Gaussian case. Note that the above wavelet-vaguelette decomposition holds, in particular, for fractal processes with high structural local variation, since they are continuous but not differentiable (see Ruiz-Medina et al. 2003).
References
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
- and (2001). Alignment gene expression time series with time warping algorithms. Bioinformatics 17, 495–508.
- (1981). The Geometry of Random Fields. Wiley, Chichester.
- and (1999). Multiresolution approximation to the stochastic inverse problem. Advances in Applied Probability 31, 1039–1057.
- , and (2008). Functional supervised classification with wavelets. Annales de l'ISUP 52, 61–80.
- and (2001). Genome-wide expression analysis of plant cell cycle modulated genes. Current Opinion in Plant Biology 4, 136–142.
- and (2009). Decoherence in yeast cell populations and its implications for genome-wide expression noise. Genetics and Molecular Research 8, 47–51.
- and (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis 92, 24–41.
- , , , , , , , , , and (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73.
- , , , , , , , , and (2001). Transcriptional regulation and function during the human cell cycle. Nature Genetics 27, 48–54.
- , and (1992). Biorthogonal bases of compactly supported wavelets. Communications on Pure and Applied Mathematics 45, 485–560.Direct Link:
- and (2008). An Introduction to Generalized Linear Models. Chapman & Hall, New York.
- (2003). Data Analysis Tools for DNA Microarrays. Chapman & Hall, New York.
- , , and (1998). Cluster analysis and display genomewide expression patterns. Proceedings of the National Academy of Sciences USA 95, 14863–14868.
- and (2005). Statistical Methods in Bioinformatics: An Introduction (Statistics for Biology and Health). Springer, New York.
- and (1996). Local Polynomial Modelling and its Applications. Chapman & Hall, London.
- and (2006). Nonparameric Functional Data Analysis. Springer, New York.
- , and (2001). A functional data-analytic approach to signal discrimination. Technometrics 43, 1–9.
- and (2009). Clustering of gene expression data based on shape similarity. Journal on Bioinformatics and Systems Biology. DOI: 10.1155/2009/195712.
- , , , and (2001). Dynamic modeling of gene expression data. Proceedings of the National Academy of Sciences 98, 16–93.
- and (2001). Functional linear discriminant analysis for irregular sampled curves. Journal of the Royal Statistical Society: Series B 63, 533–550.
- , and (2009). Functional linear regression, that's interpretable. Annals of Statistics 37, 2083–2108.
- and (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
- , and (2005). Fractional random fields associated with stochastic fractional heat equations. Advances in Applied Probability 37, 108–133.
- , and (2008). Independent arrays or independent time courses for gene expression time series data analysis. Neurocomputing 71, 23–77.
- (2000) Dynamic architecture of the yeast cell cycle uncovered by wavelet decomposition of expression microarray data. Functional and Integrative Genomics 1, 186–192.
- , , , and (2000). Global analysis of the genetic network controlling a bacterial cell cycle. Science 290, 2144–2148.
- and (2006a). Classification using functional data analysis for temporal gene expression data. Bioinformatics 22, 68–76.
- and (2006b). Time ordering of gene co-expression. Biostatistics 7, 569–584.
- and (2003). Modes and clustering for time-warped gene expression profile data. Bioinformatics 19, 1937–1944.
- and (1989). Generalized Linear Models. Chapman & Hall, London.
- (1987). Weighted local regression and kernel methods for nonparametric curve fitting. Journal of the American Statistical Association 82, 231–238.
- (2005). Functional modelling and classification of longitudinal data. Scandinavian Journal of Statistics 32, 223–240.Direct Link:
- , and (2008). Inferring gene expression dynamics via functional regression analysis. BMC Bioinformatics, 60.
- and (2005). Generalized functional linear models. Annals of Statistics 33, 774–805.
- , , , , , , , , , and (2005). Identification of cell cycle-regulated genes in fission yeast. Molecular Biology of the Cell 16, 1026–1042.
- and (1984). Cross-validation of regression models. Journal of the American Statistical Association 79, 575–583.
- and (2000). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57, 253–259.Direct Link:
- , and (2003). Fractional generalized random fields on bounded domains. Stochastic Analysis and Applications 21, 465–492.
- , , , , , , , , and (2004). Identification of cell cycle-regulated genes in fission yeast. Nature Genetics 36, 809–817.
- , , and (2000). Normalization strategies for cDNA microarrays. Nucleic Acids Research 28, E47.
- (2003). Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall, New York.
- , , , , , , , and (1998). Comprehensive identification of cell-cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hibridization. Molecular Biology of the Cell 9, 3273–3297.
- and (2001). Gene expression analysis with parametric bootstrap. Biostatistics 2, 445–461.
- (2006). Statistical Modelling by Wavelets. Wiley, New York.
- and (2005). A gene selection algorithm based on the gene regulation probability using maximal likelihood estimation. Biotechnology Letters 27, 597–603.
- and (2010). Functional embedding for the classification of gene expression profiles. Bioinformatics 26, 509–517.
- and (2006). Nonparametric Regression Methods for Longitudinal Data Analysis. Wiley, New Jersey.
- , , , , , , and (2003). Shrinkage estimation for functional principal component scores, with application to the population kinetics of plasma folate. Biometrics 59, 676–685.Direct Link:
- , and (2004). The functional data analysis view of longitudinal data. Statistica Sinica 14, 789–808.
Supporting Information
- Top of page
- Abstract
- 1 Introduction
- 2 Methods
- 3 Analysis of yeast cell-cycle gene expression profiles
- 4 Discussion and concluding remarks
- Acknowledgements
- Appendix
- References
- Supporting Information
Detailed facts of importance to specialist readers are published as ”Supporting Information”. Such documents are peer-reviewed, but not copy-edited or typeset. They are made available as submitted by the authors.
| Filename | Format | Size | Description |
|---|---|---|---|
| bimj_201000135_sm_SupplInfo.zip | 678K | SupplInfo |
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.

1521-4036/asset/2221_left.gif?v=1&s=466d214b75efb310ea016e8f6a26fbee9bd6f79b)
1521-4036/asset/2221_right.gif?v=1&s=da4fa2dfe9c52093485043516584942e1f238ef4)


































