Trend Filtering for Functional Data

Despite increasing accessibility to function data, effective methods for flexibly estimating underlying functional trend are still scarce. We thereby develop functional version of trend filtering for estimating trend of functional data indexed by time or on general graph by extending the conventional trend filtering, a powerful nonparametric trend estimation technique, for scalar data. We formulate the new trend filtering by introducing penalty terms based on $L_2$-norm of the differences of adjacent trend functions. We develop an efficient iteration algorithm for optimizing the objective function obtained by orthonormal basis expansion. Furthermore, we introduce additional penalty terms to eliminate redundant basis functions, which leads to automatic adaptation of the number of basis functions. The tuning parameter in the proposed method is selected via cross validation. We demonstrate the proposed method through simulation studies and applications to real world datasets.


Introduction
Due to advances in measurement devices and data storage resources, it is nowadays possible to observe functions as realizations of random experiments and thus functional data analysis (FDA) has expanded rapidly in recent decades.Functional versions for many branches of statistics have been provided, for example, in Ramsay (2004), Kokoszka and Reimherr (2017) and Horváth and Kokoszka (2012).
The conventional techniques of FDA for independent functional data have been recently extended to dependent situations (both time series and spatial cases).In fact, for functional time series data, standard stationary models for multivariate data have been extended (e.g.Besse et al., 2000;Klepsch and Klüppelberg, 2017;Klepsch et al., 2017;Hörmann et al., 2013;Gao et al., 2019;Hörmann et al., 2015) and theoretical properties have also been widely investigated (e.g.Bosq, 2000;Aue and Klepsch, 2017;Spangenberg, 2013;Aue et al., 2017;Kühnert, 2020;Cerovecki et al., 2019).On the other hand, effective estimations of functional trend under non-stationary situations are not well developed despite their importance in real applications.van Delft et al. (2018) addressed a framework for locally stationary functional times series, but its flexibility for trend estimation is still limited.Regarding spatial functional data, while spatial interpolation methods under spatial stationary have been developed (e.g.Giraldo et al., 2011;Nerini et al., 2010), there are some attempts to estimate non-stationary spatial trend determined by some external covariates (e.g.Caballero et al., 2013;Menafoglio et al., 2013Menafoglio et al., , 2016)).However, many flexible estimation methods for spatially varying trend are not considered.
Although many useful tools are available in FDA, the flexibility of existing methods may be limited; that is, handling abrupt changes in a trend is challenging.Hence the need for locally adaptive smoothing methods arises.For univariate time series, trend filtering (Kim et al., 2009;Tibshirani et al., 2014) is recognized as a powerful tool for locally adaptive trend estimation.Additionally, Wang et al. (2016) extended trend filtering to spatial data, which enables us to estimate spatial trend with abrupt changes revealed.
In this work, we provide an effective local smoothing method for functional time series data by extending trend filtering for scalar data.Combining L 2 -loss and L 2 -norm penalty terms for differences of adjacent functions, we successfully define the objective function for functional trend filtering.To solve the optimization problem, we expand the functional data via orthonormal basis functions and transform the objective function.We find that this transformed objective function is a mixture of the fused lasso (Tibshirani et al., 2005) and the grouped lasso (Yuan and Lin, 2006;Lounici et al., 2011;Tibshirani, 1996).This is rather different from the case of scalar, where only fused lasso-like penalties are considered.
We then develop an iterative algorithm based on the idea of ADMM (Boyd et al., 2011;Ramdas and Tibshirani, 2016), in which each updating procedure can be easily carried out.Furthermore, to satisfy a demand for selecting the optimal number of basis functions, we additionally construct an trend estimator.This also contributes to the denoising of the observed functional data.We also extend functional trend filtering from time series data to data on a graph and analyze spatial functional data.As for the tuning parameter selection, we simply suggest using cross validation, which is fairly feasible owing to the efficient optimization algorithm.
The remainder of the paper is organized as follows.Section 2 offers a brief review of trend filtering and its periphery, which is deeply related to our work.In Section 3, we present the methods, functional trend filtering, for both functional time series and spatial data, and describe the algorithm to carry out the proposed method.Also we discuss the selection of the number of basis functions.In Section 4, we compare the proposed method with some existing approaches through simulation studies.In Section 5, we apply the proposed method to functional time series (fertility rates as a function of age in each year) and functional spatial data (the number of confirmed COVID-19 cases as a function of day in Japanese prefectures).Finally we conclude with a discussion in Section 6 2 Review of trend filtering for scalar data Before describing the proposed methods for functional data, we briefly review trend filtering known as a powerful tool for locally adaptive smoothing for scalar time series data.Let y 1 , . . ., y T be a sequence of observations, and we are interested in denoising the observations to estimate the underlying trend denoted by β = (β 1 , ..., β T ) .The kth order trend filtering (Kim et al., 2009;Tibshirani et al., 2014) is defined as the minimizer of the following objective function: where λ ≥ 0 is a tuning parameter which controls the trade-off between the fit to the observed data and smoothing the trend estimation.Here ∆ (k) t is the tth row vector of the kth order discrete difference operator matrix ∆ (k) defined as For example, (1) with k = 0 is the same form of fused lasso (Tibshirani et al., 2005) and the penalty makes many differences to zero exactly and leave others nonzero values, leading to piece-wise constant estimation of β.In general, sparsity of β under kth order discrete difference operator matrix suggests that the estimated components have a specific kth order piece-wise polynomial structure (Tibshirani et al., 2014).While trend filtering is locally adaptive estimator defined by a regularization problem with nonsmooth penalty, it is still computationally efficient owing to its convexity.
Trend filtering is also applicable to spatial data (Wang et al., 2016).Let V = {1, . . ., n} be a set of sample index, which can be regard as vertex of graph G = (V, E).Here E = {e 1 , . . ., e m } is a set of undirected edges according to the spatial adjacent structure, where e h ∈ V ×V for h = 1, . . ., m and m is the total number of adjacency relationships.For example, e h = (i, j) means that ith and jth locations are adjacent.Let ∆ (0) ∈ {1, 0, −1} m×n be the oriented incidence matrix of the graph G, that is, ∆ hj = −1 and the other elements in the hth row vector of ∆ (0) is 0 if e h = (i, j).We then define k) for even k, for odd k.
This ∆ (k) is hereinafter referred to as the kth order graph difference operator matrix.
The kth order spatial trend filtering estimate is obtained as the minimizer of the following function: where λ is a tuning parameter.We remark it is also a form of fused lasso and accordingly it can be solved by basic convex optimization algorithms.Wang et al. (2016) discusses the computational aspect in detail.Notably, the penalty term encourages sparsity in graph differences in trend, which yields a piece-wise polynomial nature of the estimator as the original trend filtering (1).

Functional trend filtering
We will develop the method discussed above into functional data, that is, find a trend among functions.

Settings and objective function
Let (Ω, A, P ) be an arbitrary probability space.The space L 2 (X ) is defined as the set of all real valued square integrable functions on a compact set X ⊂ R. It is a Hilbert space with norm f L 2 = ( X f 2 (x)dx) 1/2 , which is induced by the inner product f, g = X f (x)g(x)dx for f, g ∈ L 2 (X ).We consider a serially indexed collection {Y t (•) : t = 1, ..., T } of random functions defined on the same probability space: Y t : Ω → L 2 (X ) is a measurable map.{y t (•) : t = 1, ..., T } denote a set of observed functions.The trend of propose the kth order f unctional trend f iltering defined as where λ > 0 is a tuning parameter and ∆ (k) t is the tth row vector of the kth order difference operator matrix ∆ (k) .For instance, if k = 0, the penalty is T −1 t=1 β t − β t+1 L 2 , which can be regarded as the functional version of the group fused lasso (Alaíz et al., 2013).
Henceforth, we have to solve the functional version of the group fused lasso with general order k, desiring that some elements of {∆ Since the observations and the true functions are infinite dimension and difficult to handle, we first prepare L orthonormal basis functions {φ : = 1, . . ., L} on L 2 (X ), which satisfy Then, using the approximate expansions y t (x) ≈ L =1 z t φ (x) and β t (x) ≈ L =1 b t φ (x) with z t = y t , φ and b t = β t , φ for all t and , we reduce problem (2) to minimization of the objective function The above objective function can be further rewritten as where • 2 denotes the 2 norm.Focusing on the latter half, we notice it is a mixture of kth order fused lasso type penalty and grouped lasso type penalty.In the case of scalar (Kim et al., 2009;Tibshirani et al., 2014), it is enough to consider fused lasso-like penalty, but in the case of functional data, mixture of group lasso and fused lasso-like penalty is necessary.
is differentiable with respect to b , but ∆ (k) t B 2 is not separable, namely, it cannot be represented as sum of the univariate convex function of each b t .Accordingly, solving this problem is not straightforward extension of scalar version and requires ingenuity.

Optimization
For notational simplicity, we use ∆ instead of ∆ (k) in what follows.To optimize (3), we first introduce two unit vectors, e a t ∈ R T −k−1 and e b t ∈ R L , whose only tth and th elements are 1, respectively, and the other elements are 0.
Since e b ∆ t B = e a t ∆b , we rewrite the optimization of (3) with respect to {b } as the following constraint optimization problem: subject to e b a t = e a t ∆b ( = 1, ..., L and t = 1, ..., T − k − 1).
Note that the objective function is similar to one for alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011;Ramdas and Tibshirani, 2016), which breaks the problem into smaller pieces that are easier to deal with.We then define an augmented Lagrangian function where {u t } t, is Lagrange multipliers and ρ > 0 controls the influence of the violation of equality constraint.Since there is no apparent closed form solution for b minimizing the objective function L ρ , we develop an iterative algorithm outlined in Algorithm 1, where the derivation is deferred to Appendix.The convergence of Algorithm 1 is empirically confirmed.From the existing theory of ADMMirically confirmed.algorithm, the choice of ρ is related only to the speed of convergence of the algorithm without affecting the final estimates (e.g.Boyd et al., 2011;Fukushima, 1992;He et al., 2000).In our implementation, we simply set ρ = 0.1.
Using the coefficients b TF computed by the procedure, we obtain the function for t = 1, ..., T , which is the estimator of the trend.
As an alternative smoother, we construct a simplified version of the trend estimation The difference between ( 5) and ( 2) is the penalty.Specifically, the penalty in ( 5) is the squared value of the L 2 -norm used in (2).Since the estimator defined in (5) can be regarded as an extension of Hodrick-Prescott (HP) filter (Hodrick and Prescott, 1997), we refer to this method as functional HP filter.Although the use of squared norm penalty does not produce sparsity in the differences, the method is easy to implement.In fact, by expanding (5) via orthonormal functions, we have the following approximation of the objective function: which yields the closed form solution given by b HP = I T + 2 In some situations, it works better than functional trend filtering.(Details are given in Section 4.) Finally, we discuss the choice of tuning parameter λ.In practice, the value of λ used for filtering is determined by K-fold cross validation, where we divide the dataset into K subsets by extracting every Kth function.As the estimate of b t in a validation dataset, we take the midpoint of b t−1 and b t+1 after smoothing.
Suppose that E[Y i (x)] = β i (x) for i = 1, • • • , n and we are interested in the estimation of β i (x).Let ∆ (k) be kth order graph difference operator matrix defined in Section 2. We propose the kth order f unctional trend f iltering on graph to estimate β = (β 1 , . . ., β n ) by where q = n if k is odd, q = m otherwise.The penalty quantifies how much β vary locally in the sense of kth order graph differences.We prepare L orthonormal basis functions φ for all i and .This is an extension of (Wang et al., 2016) to functional data, but the optimization is far more complicated, as we showed in time series.
In the following discussion in this section, we write ∆ for ∆ (k) p .Define two standard unit vectors e a p ∈ R q and e b ∈ R L , whose only tth and th elements are 1, respectively, and the other elements are 0, and b = (b 1 , . . ., b n ) ∈ R n .Following the same logic as the previous section, we regard the problem to find (7) as a problem to get {b } minimizing an augmented Lagrangian, for a parameter ρ > 0, To solve this problem, we can again utilize Algorithm 1.Using the acquired coefficients b TF computed by the procedure, we obtain the function for i = 1, ..., n.This is the estimator of the proposed method.
As an alternative smoothing method, we also propose an estimator: It corresponds to (5) in time series setting, or, Laplacian regularization (Smola and Kondor, 2003) for univariate data.By treatment with the same approximation as the former section, we convert the problem into the optimization problem with objective function: For = 1, ..., L, the closed form solution is given by Consequently, we obtain the estimator

Selection of the number of basis via additional regularization
The methods introduced so far are established by orthonormal basis expansion.What we need to be careful about is the necessity of choosing the number of basis functions in practice since the optimal number of basis functions depends on the complexity of the function.Here, we solve this challenge by constructing the following estimator: where λ and ψ are tuning parameters, and ω is a fixed weight for the th coefficients.When we use the principal components as basis functions, we set ω as the inverse values of the proportion of variance.The crux of this method is the last L terms.These are in the form of a group lasso, where the coefficients of each basis function are one group.This makes all the coefficients of the unnecessary basis zero and only the necessary part remains, thus allowing the selection of the number of basis.In practice, we select many basis functions beforehand and regard survivors as the essential basis functions.Also reducing unnecessary Algorithm 2 (Sparse functional trend filtering) t , u < update a t > Same as Algorithm1.
< update u t > Same as Algorithm1.
end while principal components promotes smoothing with respect to x-direction of the function while the middle terms of (9) make estimator smooth with respect to t-direction.In what follows, this method is referred as sparse f unctional trend f iltering.
For the algorithm of this method, it is sufficient to modify the way {b } is updated in Algorithm 1. See Appendix 2 for details.We also select the additional tuning parameter ψ by K-fold cross validation.

Simulation Studies
In the previous chapters, we develop the two methods.An overview of the simulation is presented in Section 4.1.In Section 4.2, to give a fair comparison of functional trend filtering and other methods, we fix the same number of basis functions for all methods.
In Section 4.3, we implement sparse functional trend filtering and another method and compare their performance.

Procedure
We investigated the performance of the proposed methods together with existing ones through simulation studies.For t = 1, ..., T (= 50) and the domain X = [1, 120], we adopted the four scenarios of the true trend function: (1) Constant: (2) Smooth: (3) Piecewise constant: (4) Varying smoothness: where f 1 , ..., f 5 are sample paths of the Gaussian process associated to RBF kernel k(x 1 , x 2 ) = In scenario (1), we examine the abilities of the methods to find the horizontal line in the presence of noise.In scenario (2), we investigate whether the adaptive methods extract the continuous curve from the noisy data.Scenario (3) unearths the capability of the methods to spot the sharp changes, the points of discontinuities, between intermittent straight horizontal lines.In scenario (4), we test the abilities to catch the trend when the smoothness of the process varies significantly due to a sharp peak in the middle as a function of t.

Functional trend filtering
For the simulated data, we apply the following three methods: -FTF: Functional trend filtering with k ∈ {0, 1, 2}.
-FPC: The standard functional principle component method using R package "fda.usc".
Note that we used the estimated principle component functions by FPC as orthonormal functions for FTF and FHP with L = 5 (the number of principle functions) to allow comparison independent of basis functions.By 10-fold cross-validation, we select the tuning parameter λ from the space [10 −3 , 10 3 ] by checking 60 points equally spaced on a logarithmic scale in all scenarios.The estimated trend functions at x = 40 are presented in Figure 2. Based on 150 times repeated simulation, we also report the mean squared error (MSE): in Table 1, where β t (x) is the estimated function.Overall, the proposed FTF tended to perform better than the other methods.Further, we can see from Figure 2 that FPC provided under-smoothed trend estimate compared with FTF and FHP, which is related to the overall performance in terms of MSE reported in Table 1.
Interestingly, the performance of FTF and FHP were quite different, although the only methodological difference is whether L 2 -norm or squared L 2 -norm is adopted in the penalty.
For example, in scenario 3, the performance of FHP was almost the same as that of FPC while FTF provided better results.This is attributed to the fact that FHP does not produce sparsity.Regarding the performance of FTF depending on k, it is observed that FTF with k = 0 provided the most accurate results in scenario 3 since the true trend admits a piecewise constant structure that FTF with k = 0 is considered to work well.In the other scenarios, however, the piecewise constant structure seems rather limited, and the performance of FTF with k = 1, 2 is more appealing.It is worth noting that FTF with k = 1 performed outstandingly well in scenario 4. Around the peak of scenario 4, the smoothness of the trend changes abruptly.The change in smoothness is nearly equal to the change in the amount of difference.Hence, the sharp peak of scenario 4 is the point where the supremacy of FTF exists.By contrast, in Scenario 1 and Scenario 2, FTF is slightly inferior to FHP.One possible reason is that FTF is a numerical solution obtained by iterative approximation while FHP is an analytical solution.

Sparse functional trend filtering
The above simulation showed that FTF accurately estimated the trend even with sudden changes.However, we need to choose the appropriate number of basis functions.Hence, we applied SFTF, introduced in Section 3.4, and investigated whether the number of basis functions could be selected.Specifically, we set L = 10 first, and then applied SFTF and cut off unnecessary basis functions.We implemented 150 simulations and calculated MSE of SFTF and mean of the number of basis functions.We searched for the optimal values of (λ, ψ) from the space [10 −3 , 10 3 ] × [10 −1 , 10 1 ] by checking 60 × 20 points equally spaced on a logarithmic scale in all scenarios.As a competitor, we apply an advanced method, dynamic functional principal component analysis (DFPC), which incorporate serial dependence.The R package "freqdom.fda"does not mention anything about parameter selection.Then, we set the parameters to minimize the MSE.Namely, we compare SFTF to this DFPC with oracle parameters.
Table 2 presents the MSE of DFPC and that of SFTF.We chose parameters that favored DFPC, but SFTF dominated it and hence the superiority of SFTF is solidified.Moreover, the number of basis functions whose coefficients were not set to zero by the SFTF was fairly smaller than 10.In particular, in scenario 2 and 4, the number of selected components was far lower than 5, indicating that many unnecessary components were used for simple FTF.
This implies that the accuracy of SFTF was substantially improved by that amount.Hence choosing the number of basis functions by excluding redundant ones plays a critical role in increasing the accuracy.

Australian fertility rates
Fertility rates in Australia have been declining seriously as in other developed countries.
We examined the data "Australiasmoothfertility", which is available from R package "rain- Here we applied SFTF to the dataset and set L = 10 first.We selected tuning parame-  the structure changes at 24th point (i.e.around 1945), implying that the trend of the fertility rate changed after the end of World War II.Owing to the sparsity in differences in trend, we easily detect those underlying events.In addition, since the detected points from the plots tend to be overlapped between k = 1 and 2 in Figure 5, trend filtering would be able to stably extract the turning points regardless of the order k.By contrast, we hardly find the structural properties of data from the original scores of the principal component.
Therefore, the result proves the ability of trend filtering to catch sharp changes.].We also applied FTF with k = 1, but the result is almost the same, thereby we do not display it here.
For a qualitative visual analysis, although FTF was smooth trend better than FPC, whose result was still jagged, as we have expected, trend filtering was more effective in that it spotted an outstanding (dark colored) prefecture, Tokyo.Evidently, FTF is able to localize its estimates around strong inhomogeneous spikes, which implies that it is able to detect the event or spot of interest.

Discussion
In this paper, we proposed a functional version of the locally adaptive smoothing technique known as trend filtering for smoothing functional time series and spatial data.The need to consider group lasso + fused lasso like penalty allows for a trivial extension of the scalar version, but we developed an efficient optimization algorithm to obtain trend estimation and discussed the choice of tuning parameter.Through simulation and empirical studies, we demonstrated the superiority of the proposed method to existing methods.
Moreover, in time series data, we can select the number of basis functions by adding a penalty.The reduction of unnecessary basis functions denoises the functions themselves, whereas trend filtering is smoother with respect to time direction.As a result, the performance of the simulation is also improved, showing that choosing the number of basis functions is better than just taking more basis functions.On the whole, penalty is the key to the methods we developed.
The optimization problem for computing the proposed method can be regarded as generalization and combination of grouped and fused lasso estimation, thereby it would be interesting to apply the proposed optimization techniques to other statistical problems, for example, regression analysis with complicated sparsity-inducing penalty functions.

Figure 1 :
Figure 1: Each surface represents a three-dimensional plot of the true trend.

Figure 2 :
Figure 2: Plots show data points and the fitted results of the three methods, FPC, FTF and FHP at x = 40 under four scenarios with noise level σ = 5.The order k of the methods is 0, 2, 0 and 1 for each scenario.
bow".The original data, obtained from the Australian Bureau of Statistics, describes the age-specific number of live births per 1000 females of ages 15, 16, ..., 49 from 1921 to 2015.The data is functional data and each function represents the age-specific number between 15 and 49 in a year.Fig 3 shows the curves with rainbow colors.The colors indicate that the oldest curve is red, the newest curve is purple and the others are colored in the same order as a rainbow.

Figure 3 :
Figure 3: Age-specific Australian fertility rates curves for ages 15 to 49 observed from 1921 to 2015 (in the same order as the color in a rainbow).

Figure 4 Figure 4 :
Figure4shows the number of births per 1000 females of ages 20 and 30 in all years and curves fitted by FPC and SFTF.Figure5displays absolute values of 1st order differences and 2nd order differences in scores of the first principal component (PC1) between the years and their trend filtered versions.First, compared with FPC, the ability of SFTF to serve as a smoother is confirmed from Figure4.It eliminates small noises, but retains the significant change points.Next, in common between age 20 and 30 in Figure4, we find abrupt changes in 1961 and 1972.After World War II, the fertility rate had increased until 1961, although the first oral contraceptive pill was released in Australia in 1961.Furthermore, in 1972, the prime minister of Australia at that time abolished the 27.5 percent luxury tax on all contraceptives(McLennan, 1998).It increased the use of the pills especially among young people and the on 395th day are shown in the upper left panel in Figure 6.We plot the data fitted by FPC in the upper right panel and FTF with k = 2 in the lower left panel, where the value of λ was selected as the argument of the minimum MSE from [10 −3 , 10 3

Figure 6 :
Figure 6: Top left: The observed number of infected people by prefecture on day 395.Top right: The number of infected people by prefecture on day 395, smoothed by FPC.Bottom left: The number of infected people by prefecture on day 395, smoothed by 2nd order FTF.