A generalized Fellner-Schall method for smoothing parameter estimation with application to Tweedie location, scale and shape models

We consider the estimation of smoothing parameters and variance components in models with a regular log likelihood subject to quadratic penalization of the model coefficients, via a generalization of the method of Fellner (1986) and Schall (1991). In particular: (i) we generalize the original method to the case of penalties that are linear in several smoothing parameters, thereby covering the important cases of tensor product and adaptive smoothers; (ii) we show why the method's steps increase the restricted marginal likelihood of the model, that it tends to converge faster than the EM algorithm, or obvious accelerations of this, and investigate its relation to Newton optimization; (iii) we generalize the method to any Fisher regular likelihood. The method represents a considerable simplification over existing methods of estimating smoothing parameters in the context of regular likelihoods, without sacrificing generality: for example, it is only necessary to compute with the same first and second derivatives of the log-likelihood required for coefficient estimation, and not with the third or fourth order derivatives required by alternative approaches. Examples are provided which would have been impossible or impractical with pre-existing Fellner-Schall methods, along with an example of a Tweedie location, scale and shape model which would be a challenge for alternative methods.


Introduction
This paper is about a very simple method for estimating the smoothing parameters and certain other variance parameters of models with a regular log likelihood, subject to quadratic penalization. The method generalizes the method of Fellner (1986) and Schall (1991), by extending the range of smooth model terms with which it can deal, and generalizing beyond the GLM setting to models with any Fisher regular likelihood. The advantage of the Fellner-Schall method is that it offers a simple explicit formula by which smoothing and variance parameters can be iteratively updated using essentially the same quantities anyway required in order to estimate the model coefficients. This has led to its use with smooth additive models, by Rigby and Stasinopoulos (2013) amongst others. However the original method has some disadvantages. Firstly it lacks generality, applying only to smooth terms each having a single smoothing parameter, so that tensor product smooth interactions and adaptive smoothers can not be employed. Rodríguez-Álvarez et al. (2015) partially remove this restriction for some tensor product smooths, but what we propose here is both simpler and more general. Secondly the original method only applies to GLM type likelihoods, with application beyond that setting relying on treating linearized approximations as Gaussian. Again what we propose is simpler and more general. Thirdly the original method derivations, while plausible, do not prove that the method increases the model restricted likelihood at each step, nor offer any insight into convergence rates. We address these issues. In short, it was possible to object that Fellner-Schall methods for updating smoothing parameters were somewhat ad-hoc and insufficiently general. This paper largely removes these objections. In part, we were motivated to undertake this work by problems in fisheries stock assessment. For example, Figure 1a shows data from a 2010 survey for mackerel eggs off the coast of western Europe. Such surveys are undertaken in order to help estimate the mass of spawning adults that must be present, and generalized additive models provide suitable spatial models for the mean egg density. As with most fisheries data, the egg counts tend to be highly over-dispersed relative to a Poisson distribution, and a Tweedie distribution (Tweedie, 1984) based model typically offers a much better fit: the variance of a Tweedie random variable y i , with mean µ i , is given by var(y i ) = φµ p i where φ and p (here 1 < p < 2) are parameters. An important biological feature is that Mackerel are known to favour spawning grounds close to the continental shelf edge, for which the 200m depth contour offers a reasonable proxy. However if mackerel are responding to sea depth, there is no good reason to suppose that this response leads only to a change in the mean density of eggs in the water column: other aspects of the distribution shape are also likely to be effected, and a reasonable model would allow the parameters p and φ to vary smoothly as sea depth varies.
In principle such a model would lie in the GAMLSS class of Rigby and Stasinopoulos (2005) and smoothing parameters could be estimated by the method of Wood et al. (2016). However, as yet there is no publicly available software for estimating a Tweedie location scale and shape model. The problem is that the Tweedie density does not have an explicit form. Rather, it involves a normalizing constant which is a function of p and µ and is computable by summing an infinite series 'from the middle'. Dunn and Smyth (2005) provide the details, while Wood et al. (2016) show how to obtain first and second derivatives of the log density with respect to p and µ: considerable care has to be taken to ensure that the computations maintain numerical stability. The smoothing parameter estimation methods of Wood et al. (2016) would require third and fourth derivatives of the log Tweedie density, and as yet there are no published methods for stable evaluation of these. Hence it would be useful to have a smoothing parameter estimation method that is general enough to encompass a Tweedie location scale and shape model, while avoiding the need for higher derivatives of the log density.
To introduce the smoothing parameter estimation problem in more detail, first consider the simple case of a Gaussian additive model for a univariate response variable where A i is the i th row of a parametric model matrix, θ is a vector of unknown coefficients, g j is a smooth function of (possibly multivariate) covariate x j , and the ǫ i are independent N (0, σ 2 ) random deviates. The g j can be represented using reduced rank spline bases, with associated quadratic penalties penalizing departure from smoothness during fitting. For example g j (x) = k b k (x)γ k , where the b k are spline basis functions and the γ k are coefficients: the associated smoothing penalty is then λ j γ T S j γ, where S j is a fixed matrix, and is usually rank deficient because some functions are treated as 'completely smooth'. λ j is a smoothing parameter controlling the strength of penalization during fitting. In general each g j may have several penalties. It is now well established (e.g Kimeldorf and Wahba, 1970;Silverman, 1985;Ruppert et al., 2003) that the smoothing penalties can be viewed as being induced by improper Gaussian prior distributions on the spline coefficients, in which case (1) can be re-written as a linear mixed effects model: where σ 2 and λ are parameters, β is a coefficient vector containing θ and the coefficients for each smooth term, and X is an n × p model matrix, containing A and the evaluated basis functions of the smooth terms. S λ is a positive semi-definite precision matrix, with Moore-Penrose pseudoinverse S − λ . Let S j be S j padded out with zeroes, so that β T S j β = γ T S j γ, where γ is the coefficient vector for g j . Then S λ = j λ j S j (some g j may each be penalized by several terms in this summation). The null space of S λ is interpretable as the space of model fixed effects, whereas the range space is the space of random effects. Obviously other simple Gaussian random effect terms can be included in the model in addition to smooth functions. Fellner (1986) developed a simple iteration for updating λ in order to maximize the restricted marginal likelihood of (2), for the special case in which S λ = j λ j I j , the I j being identity matrices with most of their diagonal entries zeroed, and no non-zero entries in common between different I j . Schall (1991) extended this to generalized linear mixed models. Here we first give a simple generalization of the Fellner-Schall method that applies to any model with the structure (2), including smooth additive models in which the smoother terms each have multiple smoothing parameters. We also show why the method improves the restricted marginal likelihood at each step, which is something not revealed by the conventional derivations of the original method. In the additive Gaussian setting our main result is the update formula We also consider updates in the case of any model giving rise to a regular likelihood, but with the previously described prior distribution structure on β, resulting in the general update (5) in section 3: generalized linear mixed models are a special case. In practice the update formula is iteratively alternated with evaluation ofβ, given the current λ estimates. The rest of the paper is structured as follows. We first consider the case of Gaussian additive models, deriving a Fellner-Schall type update that can deal with terms with multiple smoothing parameters using a derivation that shows, by construction, that the update must increase the model restricted marginal likelihood of the model. We then study the method in the context of updating one smoothing parameter from a model with several smoothing parameters, showing that it takes longer steps than the EM algorithm, or the most obvious acceleration of the EM algorithm, while not overshooting the maximum of the restricted marginal likelihood, at least in the large sample limit. The update is then generalized to the case of any Fisher-regular likelihood, at the cost of a large sample approximation borrowed from the PQL method. Finally, we present two simple examples which were not possible with previous Fellner-Schall methods, before returning to the Tweedie location scale and shape model for the Mackerel data.

Why the modified update works
For model (2), the improper log joint density of the data, y, and coefficients, β, can be written as where |S λ | + denotes the product of the non-zero eigenvalues of S λ and we use c to denote a parameter independent constant, which may vary from expression to expression. Following Wood (2011) the log restricted marginal likelihood can conveniently be written as for a given λ. Expressing the joint density and l r in this way is the key to straightforwardly obtaining a general update formula. Given that ∂( y − Xβ 2 + β T S λ β)/∂β|β λ = 0, by definition ofβ λ , we have If we were to follow the conventional derivation of the Fellner-Schall method, we would now multiply all terms in ∂l r /∂λ j by λ j , and then decide to treat two of these λ j as fixed at their previous estimate, while one is to be updated. Equating ∂l r /∂λ j to zero and re-arranging then gives the update equation. Such an approach does not reveal why the update increases l r , so we instead give an alternative derivation.
indicating that λ j should be decreased. If the inequality is reversed then ∂l r /∂λ j is positive, indicating that λ j should be increased. If the inequality becomes an equality then ∂l r /∂λ j = 0 and λ j should not be changed. A final requirement of any update is that λ j should remain positive. A simple update that clearly meets all four requirements is with λ * j set to some pre-defined upper limit ifβ T λ S jβλ is so close to zero that the limit would otherwise be exceeded. Formally ∆ = λ * − λ is an ascent direction for l r , by Taylor's theorem and the fact that ∆ T ∂l r /∂λ > 0, unless λ is already a turning point of l r . To formally guarantee that the update increases l r requires step length control, for example we use the update δ = ∆/2 k , where k is the smallest integer ≥ 0 such that l r (λ + δ) > l r (λ).
Two terms in the update have the potential to be of O(p 3 ) floating point cost, but tr{(X T X + S λ ) −1 S j } can re-use the Cholesky factor of X T X + S λ , which is anyway required to estimateβ λ , while the block diagonal nature of S λ means that in reality tr(S − λ S j ) has O(q 3 j ) computational cost, where q j is the number of coefficients affected by S j and is typically far fewer than p. Under the conditions of the original Fellner-Schall proposal, then tr(S − λ S j ) = rank(S j )/λ and we recover exactly the Fellner-Schall update, albeit with a slightly more computationally tractable expression. The update relies on the following, which is the key to the generalization beyond singly penalized smooth terms.
Theorem 1. Let B be a positive definite matrix and S λ be a positive semi-definite matrix parameterized by λ, and with a null space that is independent of the value of λ. Let positive semi-definite matrix S j denote the derivative of S λ with respect to λ j . Then tr By the conditions of the theorem the null space of S λ is independent of λ, and hence Since all the D ii in the summations are positive, by the positive semi-definiteness of S λ and the definition of M , then the terms in the second summation are each smaller than the corresponding term in the first, and the result is proved.
The variance parameter σ 2 also has to be estimated, but by setting the derivative of l r with respect to σ 2 to zero and solving we obtain

Comparison with the EM algorithm and Newton optimization
The update (3) can be viewed as a crude approximation to an EM update (Dempster et al., 1977). Specifically, the EM Q-function for model (2) has the form and ( In fact update (3) systematically makes larger changes to λ than the EM update, as illustrated in Figure 2. For insight into why this happens consider updating a single λ j relating to a block λ j S j of S λ , so that tr(S − λ S j ) = k/λ j , where k = rank(S j ). Then defining γ = tr{(X T X + S λ ′ ) −1 S j } and b =β T λ ′ S jβλ ′ /σ 2 , (3) seeks λ j to solve k/λ j = b + γλ ′ j /λ j , whereas an EM step seeks λ j to solve k/λ j = b + γ. If k/λ j > b + γ then λ j has to be increased from λ ′ j under either update. It has to be increased by more under (3), because γλ ′ j /λ j decreases monotonically from γ as λ j increases from λ ′ j . A similar argument shows that if k/λ j < b + γ then the required reduction in λ j is larger under (3) than under EM. Figure 3 shows the root finding problem corresponding to the EM update as a dashed curve, and corresponding to update (3) as a solid curve, for the same problem illustrated in Figure 2. Figure 2 also illustrates the equivalent problem for the restricted marginal likelihood itself, which can be viewed as solving the same problem as the EM update, but with both b and γ being functions of λ: the dependence of b on λ is indirect viaβ λ , but the dependence of γ is direct. This suggests using an accelerated EM update seeking to solve where γ(λ j ) = tr{(X T X + S λ ) −1 S j }. This obviously makes longer steps than the original EM update, as is illustrated by the dashed curve in Figure 3. Update (3) also results in longer update steps than this accelerated EM step, as Figure 3 suggests and the following demonstrates.
Theorem 2. Consider updating a single λ j corresponding to a diagonal block λ j S j of S λ . Update (3) takes a longer step than the equivalent accelerated EM update.

Proof. Under the stated conditions tr(S
, so to prove the result it suffices to prove that α ′ (λ j ) > α(λ j ) when λ j > λ ′ j and α ′ (λ j ) < α(λ j ) when λ j < λ ′ j . Now let S −j = i =j λ i S i , and let B be any matrix such that B T B = S −j . Consider the QR decomposition (X T , B T ) = R T Q T and form the symmetric semi-definite eigen- Taking longer steps than a plain or accelerated EM algorithm would be of limited utility if those steps overshot the maximum of the restricted likelihood and require repeated step length control, especially when close to the optimum. In practice such overshoot does not occur. The following theorem offers some insight into the reasons, albeit only asymptotically.
We assume infil asymptotics and require two technical assumptions. Assumption 1: If Q 1 denotes the first n rows of Q from theorem 2 and a = U T Q T 1 y, then a 2 The assumption is less obscure than it at first appears. To see this, first consider the very mild assumption that the model is sufficiently reasonable that y Tμ 0 = O p (n), whereμ 0 = Xβ, when λ j = 0. In fact µ 0 = Q 1 UU T Q T 1 y, and so a T a = y Tμ 0 and mean(a 2 i ) = O p (n/p) where p is the fixed dimension of a. Now let C = Q 1 U, so thatμ 0 = CC T y = p i=1 C ·i C T ·i y.μ 0 can be decomposed into p componentŝ µ 0 = iμ i , whereμ i = C ·i C T ·i y. The assumption that y Tμ i = O p (n) is essentially equivalent to assuming that no model component is orthogonal to E(y), but since a i = C T ·i y, this assumption is equivalent to a 2 i = O p (n). So Assumption 1 is reasonable, and in most cases we expect β i = 1. Assumption 2: In the notation of theorem 2, λΛ i = O p (n α i ), where α i is an unknown real constant. This simply assumes that each λΛ i has some polynomial dependence on n, but not that we know what it is.
Theorem 3. Let the setup be as in theorem 2, and letλ j denote the maximizer of the restricted likelihood with respect to λ j . Given assumptions 1 and 2, and for an initial λ j sufficiently close toλ j , the update, λ * j , given by (3) is either between λ j andλ j , or tends toλ j as n → ∞.
Proof. Dropping the subscript j, let ρ = log λ, and let λ denote the j th smoothing parameter at the start of the updates. Consider again the root finding problems equivalent to the update (3) and to maximization of the restricted marginal likelihood. Applying Taylor's theorem to the components of these root finding problems, we have that for λ sufficiently close toλ, where the derivatives are evaluated at the initial value, ρ, and So, if γ(λ) ≤ δ(λ) = −(dγ/dρ + db/dρ), then λ < λ * ≤λ. Also, if λγ(λ) → 0 and λδ(λ) → 0, as n → ∞, then |ρ * −ρ| → 0. Now consider the actual behaviour of γ(λ) and δ(λ). Using the QR and eigen-decomposition steps of theorem 2, some routine manipulation yields So the i th term of δ will be larger that the i th term of γ if λΛ i > (2a 2 i − 1) −1 : if λΛ i = O p (n α i ) this dominance occurs in the n → ∞ limit when α i > −β i . Furthermore if α i < −β i /2 then the i th terms of γλ and δλ both tend to zero in the large sample limit. So in the large sample limit, sufficiently close toλ, there are only two non-exclusive possibilities: γ(λ) < δ(λ) so that λ * lies between λ andλ, and/or λδ, λγ → 0 so that |λ * −λ| → 0.
The solution of the linearised root finding problem corresponding to the restricted likelihood maximisation is the Newton method update. So a corollary of theorem 3 is that iteration of update (3) will generally converge more slowly than Newton's method, when close to the optimum, and certainly no faster.

Beyond the linear Gaussian case
Now consider replacing the Gaussian log likelihood with another log likelihood, l, meeting the Fisher regularity conditions, so that the improper log joint density becomes log f λ (y, β) = l(β) − β T S λ β/2 + log |S λ | + + c and in the large sample limit β|y ∼ N (β λ , V λ ) where V −1 λ = H λ or EH λ and H λ = −∂ 2 l/∂β∂β T + S λ . Newton's method can be used to findβ λ , with the usual modifications to guarantee convergence (e.g. Wood, 2015, §5.1.1). Following Wood et al. (2016) the log Laplace approximate marginal likelihood in this case is conveniently expressed as Defining H = −∂ 2 l/∂β∂β T , we have The direct dependence of ∂ 2 l/∂β∂β T on λ j is inconvenient. However the PQL and performance oriented iteration methods for λ estimation of Breslow and Clayton (1993) and Gu (1992) both neglect the dependence of ∂ 2 l/∂β∂β T on λ, on the basis that it anyway tends to zero in the large sampe limit. If we follow these precedents then the development follows the Gaussian case and the update is If ∂ 2 l/∂β∂β T is independent of λ at finite sample size, as is the case for some distribution -link function combinations in a generalized linear model setting, then the update is guaranteed to increase l r under step size control, but otherwise this is not the case, and in practice the λ estimate no longer exactly maximizes l r . Theorem 1, required to guarantee that λ * j > 0, will hold if V λ is based on the expected Hessian of the negative log likelihood, but if it is based on the observed Hessian, then this must be positive definite for the Theorem to hold. Hence, if the observed Hessian is not positive definite then the expected Hessian, or a suitable nearest positive definite matrix to the observed Hessian, should be substituted.
As in the Gaussian case a link to the EM update can again be established via an approximate Q function, obtained by taking a second order Taylor expansion of l aroundβ λ , and using the large sample distribution of β|y: The final term is then neglected, again following the PQL type assumption.
In the case of a penalized generalized linear model, the general update (5) becomes where W is the diagonal matrix of weights at convergence of the usual penalized iteratively re-weighted least squares iteration used to findβ λ , and φ is the scale parameter, which can be estimated using the obvious equivalent ofσ 2 . Again, under the restrictions of the Fellner-Schall method, this update corresponds to the Schall update for the generalized linear mixed model case.

Simple examples
This section presents two brief example applications of the generalised Fellner-Schall method developed here, which would be impossible or impractically slow with previously published versions of the method. The first example is a simple Gaussian adaptive smooth of the motorcycle data from Silverman (1985), available in the MASS package (Venables and Ripley, 2002) in R (R Core Team, 2014). The data are accelerations of the head of a crash test dummy against time. An adaptive smooth as described in Wood (2011) is appropriate for smoothing the acceleration data against time, with the degree of smoothness of a P-spline (Eilers and Marx, 1996)  has five smoothing parameters with the penalties acting on overlapping subsets of the 40 model coefficients, thereby violating the structural conditions on S λ required by previously published Fellner-Shall iterations. The smoothing parameter optimization problem is relatively challenging as the smoothing parameters are only weakly identified from this relatively small dataset.
The smooth was estimated using the method presented here and by the method of Wood (2011) using quasi-Newton optimization of the restricted marginal likelihood. In this way both methods have the same leading order computational cost per iteration, facilitating comparison. Full Newton optimization is more costly per iteration, but would require fewer iterations than quasi-Newton. Starting from all smoothing parameters set to 1, and without step length control, the new method converged in 39 steps, as against 32 for the quasi-Newton method. The fits are identical to graphical accuracy with equal effective degrees of freedom of 12.22. See Figure 4.
The second example is a Cox proportional hazards model for time to recurrence of colon cancer for n = 929 patients in a chemotherapy trial (Moertel et al., 1995) available in the survival package (Therneau, 2015) in R. In principle it is possible to use previously published Fellner-Schall methods for this example, by using a trick involving Poisson regression on artificially replicated data, but this entails an O(n) multiplication of the computational cost, which is impractically uncompetitive with existing methods. With the update (5) the cost is kept at the O(np 2 ) that is appropriate for Cox regressions.
The linear predictor for the Cox regression had parametric effects for whether the colon was perforated or not, obstructed or not and whether the tumour had adhered to neighbouring organs. In addition a 3 level factor indicated the control group, treatment with one drug of interest or treatment with a drug combination. Smooth effects of age were included separately for males and females along with a smooth effect for number of affected lymph nodes. For this example the new iteration, without step length control, converged in 15 steps, compared to 16 steps for direct quasi-Newton optimization using the methods of Wood et al. (2016). The parametric model coefficients differ only in the 4th significant digit, while differences in the estimated smooth effects are also small, as shown in Figure 5.

A Tweedie location, scale and shape model for Mackerel
We now return to the motivating example, from the introduction, of modelling mackerel (Scomber scombrus) egg densities from survey data collected off the west coast of Europe in 2010. The data consist of counts of eggs in samples taken from the water column at the sampling stations shown in Figure 1. Available covariates are temperature and salinity at 20 m depth, water volume sampled (an offset), spatial location as longitude and latitude (converted to km east and km north), the identity of the ship collecting the data, and the sea bed depth. The latter is important as Mackerel prefer to spawn near the continental shelf edge, which occurs at a depth contour of about 200m.
A common theme with data of this type is that the counts are highly over-dispersed relative to a Poisson distribution, but with a mean variance relationship that is less extreme than that suggested by a negative binomial distribution (see e.g. Wood, 2006, §5.4.1). A Tweedie (1984) distribution often offers a much better characterisation of the distribution, but it would often be useful to allow the shape and scale parameters of the Tweedie distribution to vary with covariates, rather than only allowing covariates for the mean. Specifically, the Tweedie distribution assumes that the variance of random variable y i is related to its mean, µ i via var(y i ) = φ i µ p i i . where the parameters φ i and p i are parameters usually taking one fixed value for all i. For the mackerel data it would be useful to allow p i and φ i to be smooth functions of covariates -particularly sea bed depth.
In particular we would like to estimate the model where the g k are smooth functions, h is a known link function designed to keep 1 < p < 2, s(i) indicates which ship collected sample i and b s(i) are independent N (0, σ 2 b ) random effects. We represented the spatial effect using a rank 150 Duchon spline with first order derivative penalisation (see Duchon, 1977;Miller and Wood, 2014), and the other terms with rank 10 cubic penalised regression splines. The model can be estimated, given smoothing parameters, using the Newton iteration detailed in Wood et al. (2016) and available in R package mgcv. However the estimation of smoothing parameters using Wood et al. (2016) would require third and fourth derivatives of the Tweedie density and these are not readily available, for the reasons given in the introduction. We therefore estimated the smoothing parameters using the iterative update (5).
Estimation converged in 13 iterations taking 17 seconds (single core of a mid range laptop computer). In comparison it took 11 seconds to fit the same model, but with fixed p and φ, using the method of Wood et al. (2016) in R package mgcv. The AIC for model (6) was 180 lower than for the fixed p and φ version, although residual plots (not shown) are reasonable for both models. The estimated spatial smoother is shown in Figure 1b, while the remaining effects are plotted in Figure 6. Notice how the smooth effects of sea depth all have a pronounced peak at around √ 200, corresponding to the edge of the continental shelf. Both egg density and its variability appear to be peaking near the shelf edge.  Figure 6: Estimated smooth effects for the Tweedie location scale and shape model of the Mackerel egg survey data discussed in section 5. Panel c shows a QQ-plot for the predicted ship level random effects. Panels d, e and f are the smooth effects of sea depth for µ, p and φ respectively. Notice how they all have a peak close to √ 200, the depth representing the continental shelf edge. The shaded regions are approximate 95% confidence intervals.
PQL and performance oriented iteration methods, and neglect the dependence of the Hessian of the log likelihood on the smoothing parameters.
As we demonstrated in section 5, our generalized Fellner-Schall method can be applied to cases in which alternative estimation methods would be very difficult to implement, but it also offers advantages in settings which are in principle less numerically taxing. The method can be applied to non-standard smooth models provided that we can obtain the first and second derivatives of the log-likelihood, which are anyway required for Newton optimization of model coefficients. This greatly simplifies the process of implementing non-standard models for particular applied problems, freeing the modeller from the more onerous aspects of implementation, to concentrate on development of the model itself. To gain insight into the effort saved, the reader might care to compare the expressions for the 4 th order and second order derivatives of the generalized extreme value distribution, for example.
Finally, an interesting question raised by the work here, is whether it is possible to reduce the implementation cost even further by replacing the Hessian of the log-likelihood in the update by a Quasi-Newton approximation, thereby allowing coefficients to be estimated by Quasi-Newton methods, and only requiring first derivatives of the log-likelihood.