Fast covariance estimation for multivariate sparse functional data

Covariance estimation is essential yet underdeveloped for analysing multivariate functional data. We propose a fast covariance estimation method for multivariate sparse functional data using bivariate penalized splines. The tensor‐product B‐spline formulation of the proposed method enables a simple spectral decomposition of the associated covariance operator and explicit expressions of the resulting eigenfunctions as linear combinations of B‐spline bases, thereby dramatically facilitating subsequent principal component analysis. We derive a fast algorithm for selecting the smoothing parameters in covariance smoothing using leave‐one‐subject‐out cross‐validation. The method is evaluated with extensive numerical studies and applied to an Alzheimer's disease study with multiple longitudinal outcomes.

The interest of the paper is FPCA for multivariate sparse functional data, where multiple responses are observed at time points that vary from subject to subject and may even vary between responses within subjects. There are much fewer works to handle such data. The approach in Zhou et al. (2008) focuses on bivariate functional data and can be extended to more than two-dimensional functional data, although model selection (e.g., selection of smoothing parameters) can be computationally difficult and convergence of the expectation-maximization estimation algorithm could also be an issue. The local polynomial method in  can be applied to multivariate sparse functional data, although a major drawback is the selection of multiple bandwidths. Moreover, because the local polynomial method is a local approach, there is no guarantee that the resulting estimates of covariance functions will lead to a properly defined covariance operator. The approach in Happ and Greven (2018) (denoted by mFPCA hereafter) estimates cross-covariances via scores from univariate FPCA and hence can be applied to multivariate sparse functional data. Although mFPCA is theoretically sound for dense functional data, it may not capture cross-correlations between functions because scores from univariate FPCA for sparse functional data are shrunk towards zero.
We propose a novel and fast covariance-based FPCA method for multivariate sparse functional data. Note that multiple auto-covariance functions for within-function correlations and cross-covariance functions for between-function correlations have to be estimated. Tensor-product Bsplines are employed to approximate the covariance functions, and a smoothness penalty as in bivariate penalized splines (Eilers & Marx, 2003) is adopted to avoid overfit. Then, the individual estimates of covariance functions will be pooled and refined. The advantages of the new method are multifold. First, the tensor-product B-spline formulation is computationally efficient to handle multivariate sparse functional data. Second, a fast fitting algorithm for selecting the smoothing parameters will be derived, which alleviates the computational burden of conducting leave-one-subject-out cross-validation. Third, the tensor-product B-spline representation of the covariance functions enables a straightforward spectral decomposition of the covariance operator for the multivariate functional data; see Proposition 1. In particular, the eigenfunctions associated with the covariance operator are explicit functions of the B-spline bases. Last but not the least, via a simple truncation step, the refined estimates of the covariance functions lead to a properly defined covariance operator.
Compared with mFPCA, the proposed method does not rely on scores from univariate FPCA, which could be a severe problem for sparse functional data and hence could better capture the correlations between functions. And an improved correlation estimation will lead to improved subsequent FPCA analysis and curve prediction. The proposed method also compares favourably with the local polynomial method in  because of the computationally efficient tensor-product spline formulation of the covariance functions and the derived fast algorithm for selecting the smoothing parameters. Moreover, as mentioned above, there exists an explicit and easy-to-calculate relationship between the tensor-product spline representation of covariance functions and the associated eigenfunctions/eigenvalues, which greatly facilitate subsequent FPCA analysis.
In addition to FPCA, there are also abundant literatures on models for multivariate functional data with most focusing on dense functional data.
The remainder of the paper proceeds as follows. In Section 2, we present our proposed method. We conduct extensive simulation studies in Section 3 and apply the proposed method to an Alzheimer's disease (AD) study in Section 4. A discussion is given in Section 5. All technical details are enclosed in the Appendix. where C k (s,t)=(C k1 (s,t),…,C kp (s,t)) ⊤ . Note that Γ is a linear, self-adjoint, compact, and non-negative integral operator. By the Hilbert-Schmidt theorem, there exists a set of orthonormal bases fΨ ℓ g ℓ≥1 ∈ H, Ψ ℓ ¼ Ψ where d ℓ is the ℓth largest eigenvalue corresponding to Ψ ℓ . Then, the multivariate Mercer's theorem gives ℓ ðsÞΨ ðk′Þ ℓ ðtÞ. As shown in Saporta (1981), x(t) has the multivariate Karhunen-Loève representation, The covariance operator Γ has the positive semidefiniteness property; that is, for any a ¼ ða 1 ; …; a p Þ ⊤ ∈ R p , the covariance function of a ⊤ x, denoted by C a (s,t), satisfies that for any sets of time points ðt 1 ; …; t q Þ ⊂ T with an arbitrary positive integer q, the square matrix ½C a ðt i ; t j Þ f1≤i; j≤qg ∈ R q×q is positive semidefinite.

| Covariance estimation by bivariate penalized splines
Suppose that the observed data take the form y ij is the observed kth response, n is the number of subjects, and m ik is the number of observations for subject i's kth response. The model is ij are random noises with zero means and variances σ 2 k and are independent across i, j, and k. The goal is to estimate the covariance functions C kk′ . We adopt a three-step procedure. In the first step, empirical estimates of the covariance functions are constructed. Let r be the residuals and C be the auxiliary variables. Note that ij 1 j 2 is an unbiased estimate of C kk ′ t ðkÞ ij 1 ; t ðk ′ Þ ij 2 whenever k≠k′ or j 1 ≠j 2 . In the second step, the noisy auxiliary variables are smoothed to obtain smooth estimates of the covariance functions. For smoothing, we use bivariate P-splines (Eilers & Marx, 2003) because it is an automatic smoother and is computationally simple. In the final step, we pool all estimates of the individual covariance functions and use an extra step of eigendecomposition to obtain refined estimates of covariance functions. The refined estimates lead to a covariance operator that is properly defined, that is, positive semidefinite. In practice, the mean functions μ (k) s are unknown, and we estimate them using P-splines (Eilers & Marx, 1996) with the smoothing parameters selected by leave-one-subject-out cross-validation; see the Supporting Information for details. Denote the estimates by b ij 2 , the actual auxiliary variables.
The bivariate P-splines model C kk ′ ðs; tÞ uses tensor-product splines G kk ′ ðs; tÞ for 1 ≤ k,k′ ≤ p. Specifically, G kk ′ ðs; ∈ R c×c is a coefficient matrix, {B 1 (·),…,B c (·)} is the collection of B-spline basis functions in T, and c is the number of equally spaced interior knots plus the order (degree plus 1) of the B-splines. Because C kk ′ ðs; tÞ ¼ C k ′ k ðt; sÞ ¼ Cov x ðkÞ ðsÞ; x ðk ′ Þ ðtÞ n o , it is reasonable to impose the assumption that so that G kk′ (s,t)=G k′k (t,s). Therefore, in the rest of the section, we consider only k ≤ k′.
Let D ∈ R ðc−2Þ×c denote a second-order differencing matrix such that for a vector a ¼ ða 1 ; …; a c Þ ⊤ ∈ R c , Da ¼ ða 3 −2a 2 þ a 1 ; a 4 −2a 3 þ a 2 ; …; a c −2a c−1 þ a c−2 Þ ⊤ ∈ R c−2 . Also let ‖·‖ F be the Frobenius norm. For the cross-covariance function C kk ′ ðs; tÞ with k<k′, the bivariate P-splines estimate the coefficient matrix Θ kk′ by b Θ kk ′ , which minimizes the penalized least squares where λ kk ′ 1 and λ kk ′ 2 are two non-negative smoothing parameters that balance the model fit and smoothness of the estimate and will be determined later. Indeed, the column penalty ‖DΘ kk ′ ‖ 2 F penalizes the second-order consecutive differences of the columns of Θ kk′ and similarly, the row penalty ‖DΘ ⊤ kk ′ ‖ 2 F penalizes the second-order consecutive differences of the rows of Θ kk′ . The two penalty terms are essentially penalizing the second-order partial derivatives of G kk′ (s,t) along the s and t directions. The two smoothing parameters are allowed to differ to accommodate different levels of smoothing along the two directions.
For the auto-covariance functions C kk (s,t) with k=1,…,p, we conduct bivariate covariance smoothing by enforcing the following constraint on the coefficient matrix Θ kk : It follows that G kk (s,t) is a symmetric function. Then, the coefficient matrix Θ kk and the error variance σ 2 k are jointly estimated by b Θ kk and b σ 2 k , which minimize the penalized least squares over all symmetric Θ kk and λ k is a smoothing parameter. Note that the two penalty terms in Equation (4) become the same when Θ kk is symmetric, and thus, only one smoothing parameter is needed for auto-covariance estimation.

| Estimation
We first introduce the notation. Let vec(·) be an operator that stacks the columns of a matrix into a column vector and denote by ⊗ the Kronecker product. Fix k and k′ with k ≤ k′. Let θ kk ′ ¼ vecðΘ kk ′ Þ ∈ R c 2 be a vector of the coefficients and bðtÞ ¼ fB 1 ðtÞ; …; B c ðtÞg ⊤ ∈ R c denotes the B-spline base. Then, We now organize the auxiliary responses Ĉ ðkk′Þ ij 1 j 2 for each pair of k and k′. Let r is the total number of auxiliary responses for the pair of k and k′. As for the B- For estimation of the cross-covariance functions C kk′ with k<k′, the penalized least squares in Equation (4) can be rewritten as where P 1 ¼ I c ⊗D ⊤ D and P 2 ¼ D ⊤ D⊗I c . The expression in Equation (7) is a quadratic function of the coefficient vector θ kk′ . Therefore, we derive For estimation of the auto-covariance functions, because of the constraint on the coefficient matrix in Equation (5), let χ k ∈ R cðcþ1Þ=2 be a vector obtained by stacking the columns of the lower triangle of Θ kk , and let G c ∈ R c 2 ×cðcþ1Þ=2 be a duplication matrix such that θ kk =G c η k (Seber, 2008, It follows that the penalized least squares in Equation (6) can be rewritten as The above estimates of covariance functions may not lead to a positive semidefinite covariance operator and thus have to be refined. We pool all estimates together, and we shall use the following proposition.
The proof is provided in Appendix A. Proposition 1 implies that, with the tensor-product B-spline representation of the covariance functions, one spectral decomposition gives us the eigenvalues and eigenfunctions. In particular, the eigenfunctions Ψ ðkÞ ℓ ðtÞ are linear combinations of the Bspline basis functions, which means that they can be straightforwardly evaluated, an advantage of spline-based methods compared with other smoothing methods for which eigenfunctions are approximated by spectral decompositions of the covariance functions evaluated at a grid of time points.
We discard negative b d ℓ to ensure that the multivariate covariance operator is positive semidefinite, and this leads to a refined estimate of the coefficient matrix Then, the refined estimate of the covariance functions is e C kk ′ ðs; tÞ ¼ bðsÞ ⊤ e Θ kk ′ bðtÞ. Proposition 1 also suggests that the eigenfunctions can be estimated by e Ψ ðkÞ ℓ ðtÞ ¼ bðtÞ ⊤ G − 1 2 b u ðkÞ ℓ . For principal component analysis or curve prediction in practice, one may select further the number of principal components by either the proportion of variance explained (PVE) (Greven, Crainiceanu, Caffo, & Reich, 2010) or an Akaike information criterion-type criterion (Li, Wang, & Carroll, 2013). Here, we follow Greven et al. (2010) using PVE with a value of 0.99.

| Selection of smoothing parameters
We select the smoothing parameters in each auto-covariance/cross-covariance estimation using leave-one-subject-out cross-validation; see, for example, Yao et al. (2005) and . A fast approximate algorithm for the auto-covariance has been derived in .
So we focus on the cross-covariance and use the notation in Equation (7). Note that there are two smoothing parameters for each crosscovariance estimation.
For simplicity, we suppress the superscript and subscript kk′ in Equation (7) for both b C and B. Let e C ½i i be the prediction of the auxiliary responses b C i from the estimate using data without the ith subject. Let ‖·‖ be the Euclidean norm, and the cross-validation error is We shall also now suppress the subscript k from m ik and kk′ from N kk′ .
i . Then, a shortcut formula for Equation (8) is Similar to Xu and Huang (2012) and , the iCV can be further simplified by adopting the approximation ðI m 2 which results in the generalized cross-validation, denoted by iGCV, Although iGCV is much easier to compute than iCV, the formula in Equation (9) is still computationally expensive to compute. Indeed, the smoother matrix S is of dimension 2,500×2,500 if n=100 and m i =m=5 for all i. Thus, we need to further simplify the formula.
n ∈ R c 2 ×c 2 , and ∑ ¼ I c 2 þ λ 1 e P 1 þ λ 2 e P 2 . Then, Equation (9) can be simplified as Note that Σ has two smoothing parameters. Following Wood (2000), we use an equivalent parameterization ∑ ¼ I þ ρfw e P 1 þ ð1 − wÞ e P 2 g, where ρ=λ 1 +λ 2 represents the overall smoothing level and w=λ 1 ρ −1 ∈[0,1] is the relative weight of λ 1 . We conduct a two-dimensional grid search of (ρ,w), as follows. For a given w, let Udiag(s)U ⊤ be the eigendecompsition of w e P 1 þ ð1 − wÞ e P 2 , where U ∈ R c 2 ×c 2 is an orthonormal matrix and Proposition 2. Let ⊙ stand for the point-wise multiplication. Then, where The proof is provided in Appendix A. For each w, note that only e d depends on ρ and needs to be calculated repeatedly, and all other terms need to be calculated only once. The entire algorithm is presented in Algorithm 1. We give an evaluation of the complexity of the proposed algorithm.
Assume that m i =m for all i. The first initialization (Step 1) requires O(nm 2 c 2 +nc 4 +c 6 ) computations. For each w, the second initialization ( Step 2) also requires Ofnc 4 minðm 2 ; c 2 Þ þ c 6 g computations. For each ρ, Steps 3-8 require O(nc 4 ) computations. Therefore, the formula in Proposition 2 is most efficient to calculate for sparse data with small numbers of observations per subject; that is, m i s are small.

| Prediction
For prediction, assume that the smooth curve x i (t) is generated from a multivariate Gaussian process. Suppose that we want to predict the ith mul- It follows that Thus, we obtain Plugging in the estimates, we predict x i by b Therefore, a 95% point-wise confidence interval for the kth response is given by Finally, we predict the first L ≥ 1 scores ξ i =(ξ i1 ,…,ξ iL ) ⊤ for the ith subject. Note that ξ iℓ ¼ ∫Ψ ℓ ðtÞ ⊤ fx i ðtÞ−μðtÞgdt. With a similar derivation as above, x i (t)−μ(t) can be predicted by

| SIMULATIONS
We evaluate the finite sample performance of the proposed method (denoted by mFACEs) against mFPCA via a synthetic simulation study and a simulation study mimicking the Alzheimer's Disease Neuroimaging Initiative (ADNI) data in the real data example. Here, we report the details and results of the former as the conclusions remain the same for the latter, and details are provided in the Supporting Information.

| Simulation settings and evaluation criteria
We generate data by model (3)   Then, the auto-covariance functions are C kk ðs; tÞ ¼ Φ k ðsÞ ⊤ Λ kk Φ k ðtÞ; k ¼ 1; 2; 3. For the cross-covariance functions, let is a parameter to be specified. The induced covariance operator from the above specifications is proper; see Lemma 1 in Appendix B. It is easy to derive that the absolute value of cross-correlation is bounded by ρ. Hence, ρ controls the overall level of correlation between responses: if ρ=0, then the responses are uncorrelated from each other.
The eigendecomposition of the multivariate covariance function gives nine nonzero eigenvalues with associated multivariate eigenfunctions; hence, for ℓ=1,…,9, we simulate the scores ξ iℓ from Nð0; d ℓ Þ, where d ℓ are the induced eigenvalues. Next, we simulate the white noises ϵ ðkÞ ij from Nð0; σ 2 ϵ Þ, where σ 2 ϵ is determined according to the signal-to-noise ratio SNR ¼ ∑ ℓ d ℓ =ðpσ 2 ϵ Þ. Here, we let SNR=2. For each response, the sampling time points are drawn from a uniform distribution in the unit interval, and the number of observations for each subject, m ik , is generated from a uniform discrete distribution on {3,4,5,6,7}. Thus, the sampling points vary not only from subject to subject but also across responses within each subject.
We use a factorial design with two factors: the number of subjects n and the correlation parameter ρ. We let n=100,200, or 400. We let ρ=0.5, which corresponds to a weak correlation between responses as the average absolute correlation between responses is only 0.36. Another value of ρ is 0.9, which corresponds to a moderate correlation between responses as the average absolute correlation between responses is about 0.50.
In total, we have six model conditions, and for each model condition, we generate 200 datasets. To evaluate the prediction accuracy of the various methods, we draw 200 additional subjects as testing data. The true correlation functions and a sample of the simulated data are shown in the Supporting Information.
We compare mFACEs and mFPCA in terms of estimation accuracy of the covariance functions, the eigenfunctions and eigenvalues, and prediction of new subjects. For covariance function estimation, we use the relative integrated square errors (RISE). Let Ĉ kk ′ ðs; tÞ be an estimate of C kk ′ ðs; tÞ, and then RISE are given by For predicting new curves, we use the mean integrated square errors (MISE), which are given by For the curve prediction using mFPCA, we truncate the number of principal components using a PVE of 0.99. It is worth noting that if no truncation is adopted, then the curve prediction using mFPCA reduces to curve prediction using univariate FPCA. We shall also consider the conditional expectation method based on the estimates of covariance functions from mFPCA. The method is denoted by mFPCA(CE), and its difference with mFACEs is that different estimates of covariance functions are used. Figure 1 gives boxplots of RISEs of mFACEs and mFPCA for estimating covariance functions. Under all model conditions, mFACES outperforms mFPCA, and the improvement in RISEs as the sample size increases is much more pronounced for mFACEs. Under the model conditions with moderate correlations (ρ=0.9), the advantage of mFACEs is substantial even for the small sample size n=100. respectively. The top two eigenvalues account for about 60% of the total variation in the functional data for ρ=0.5, and it is 80% for ρ=0.9. Figure 2 shows that although the two methods are overall comparable for estimating the first eigenfunction, mFACEs has a much better accuracy for estimating the second eigenfunction than mFPCA. The violin plots in Figure 3 show that mFACEs outperforms mFPCA substantially for estimating both eigenvalues under all model conditions. The mFPCA always underestimates the eigenvalues as the variation of scores from univariate FPCA is smaller than the true variation and hence leads to underestimates of eigenvalues.

| Simulation results
Finally, we consider the prediction of new subjects by mFACEs, mFPCA, and mFPCA(CE). We define the relative efficiencies of different methods as the ratios of MISEs with respect to those of univariate FPCA; see Figure 4. Univariate FPCA is implemented in the R package face . We have the following findings. Under all model conditions, mFACEs has the smallest MISE, mFPCA(CE) has the second best performance, and mFPCA is close to univariate FPCA. Thus, on average, mFACEs provides the most accurate curve prediction.
These results indicate that (a) mFACEs has better covariance estimation than mFPCA(CE) and so is the prediction based on it; (b) compared with mFPCA/univariate FPCA, mFPCA(CE) exploits the correlation information and hence results in better predictions.
In summary, mFACEs shows competing performance against alternative methods.

| APPLICATION TO ALZHEIMER'S DISEASE STUDY
The ADNI is a two-stage longitudinal observational study launched in year 2003 with the primary goal of investigating whether serial neuroimages, biological markers, clinical and neuropsychological assessments can be combined to measure the progression of AD (Weiner et al., 2017). The ADNI-1 data from the first stage contain 379 patients with amnestic mild cognitive impairment (MCI, a risk state for AD) at baseline who had at least one follow-up visit. Participants were assessed at baseline, 6, 12, 18, 24, and 36 months with additional annual follow-ups included in the second stage of the study. At each visit, various neuropsychological assessments, clinical measures, and brain images were collected. The ADNI-2 data include 424 additional patients suffering from MCI and significant memory concern, with at least one follow-up visit and longitudinal data collected over 4 years. Thus, for the combined data, the total number of subjects is 803, and the average number of visits is 4.72. The data are publicly available at http://ida.loni.ucla.edu/.
We consider five longitudinal markers commonly measured in studies of AD with strong comparative predictive value (

| Multivariate FPCA via mFACEs
We analyse the five longitudinal biomarkers using mFACEs. For better visualization, we plot in Figure 5 the estimated correlation functions   progression. Specifically, these subjects with a positive score for the second eigenfunction will have higher ADAS-Cog 13/FAQ and lower RAVLT and MMSE over the months, suggesting of AD progression. Finally, we illustrate in Figure 7 the predicted curves along with the associated 95% point-wise confidence bands for three subjects. We focus on predicting the trajectories over the first 4 years as there are more observations. We can see that the confidence bands are getting wider at the later time points because of fewer observations.

| Comparison of prediction performance of different methods
We compare the proposed mFACEs with mFPCA and mFPCA(CE) for predicting the five longitudinal biomarkers. The prediction performance is evaluated by the average squared prediction errors (APE), where ŷ ðkÞ ij is the predicted value of the kth biomarker for the ith subject at time t ðkÞ ij . We conduct two types of validation: an internal validation and an external validation. For the internal validation, we perform a 10-fold cross-validation to the combined data of ADNI-1 and ADNI-2. For the external validation, we fit the model using only the ADNI-1 data and then predict ADNI-2 data. Figure 8 summarizes the results. For simplicity, we present the relative efficiency of APE, which is the ratio of APEs of one method against the mFPCA. In both cases, mFACEs achieves better prediction accuracy than competing methods. Note that mFPCA(CE) outperforms mFPCA for predicting almost all biomarkers. The results suggest that (a) mFACEs is better than competing methods for analysing the longitudinal biomarkers. and (b) exploiting the correlations between the biomarkers improves prediction.

| DISCUSSION
The prevalence of multivariate functional data has sparked much research interests in recent years. However, covariance estimation for multivariate sparse functional data remains underdeveloped. We proposed a new method, mFACEs, and its features include the following: (a) A covariance smoothing framework is proposed to tackle multivariate sparse functional data; (b) an automatic and fast fitting algorithm is adopted to ensure the scalability of the method; (c) eigenfunctions and eigenvalues can be obtained through a one-time spectral decomposition, and eigenfunctions can be easily evaluated at any sampling points; and (d) a multivariate extension of the conditional expectation approach (Yao et al., 2005) is derived to exploit correlations between outcomes. The simulation study and the data example showed that mFACEs could better capture between-function correlations and thus gives improved principal component analysis and curve prediction.
When the magnitude of functional data are quite different, one may first normalize the functional data, as recommended by .
One method of normalization is to rescale the functional data using the estimated variance function Ĉ kk ðt; tÞ −1=2 as in  and Jacques and Preda (2014). An alternative method is to use a global rescaling factor like ∫Ĉ kk ðt; tÞdt −1=2 as in Happ and Greven (2018). Both methods can be easily incorporated into our proposed method. In our data analysis, we find that the results with normalization are very close to those without normalization; thus, we present the results without normalization.
Because multivariate FPCA is more complex than univariate FPCA, weak correlations between the functions and small sample size may offset the benefit of conducting multivariate FPCA; see Section 7.3 in . Thus, it is of future interest to develop practical tests to determine if correlations between multivariate functional data are different from 0.
The mFACEs method has been implemented in an R package mfaces and will be submitted to CRAN for public access.

DATA AVAILABILITY STATEMENT
Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc. edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this paper. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/ how_to_apply/ADNI_Acknowledgement_List.pdf. As 1 ≤ k,k′ ≤ p, the above is equivalent tǒ Because of Equation (B1), u ℓ s are orthonormal eigenvectors ofΘ with d ℓ s the corresponding eigenvalues. The proof is now complete.
Proof of Proposition 2. By Equation (10), Next we derive that It follows that  which is always positive semidefinite, and the proof is complete.