Identifiability constraints in generalized additive models

Identifiability constraints are necessary for parameter estimation when fitting models with nonlinear covariate associations. The choice of constraint affects standard errors of the estimated curve. Centring constraints are often applied by default because they are thought to yield lowest standard errors out of any constraint, but this claim has not been investigated. We show that whether centring constraints are optimal depends on the response distribution and parameterization, and that for natural exponential family responses under the canonical parametrization, centring constraints are optimal only for Gaussian response.


INTRODUCTION
A generalized additive model (GAM) is a regression model suitable for making inferences about nonlinear covariate associations.Introduced by Hastie & Tibshirani (1986, 1990) and further developed by Wood (2003Wood ( , 2004Wood ( , 2011)), Wood, Pya & Säfken (2016), Wood (2017), GAMs have become a core component of modern regression modelling.Unlike linear models, GAMs are able to model associations between covariates and a response variable that take a wide range of shapes, and these shapes do not need to be specified in advance, leading to a flexible and useful class of regression models.See Figure 1 for an example from air pollution epidemiology, where the associations of mortality with time and air pollution exhibit nonlinear patterns, but with very different estimated shapes, and there is no plausible way to specify these shapes in advance.However, owing to their flexibility, in their basic form, GAMs are over-parameterized and the associations are not all identifiable without some sort of additional constraint.Despite their popularity in modern literature and practice, the question of what constraints lead to identifiable models-and their effect on parameter estimation, specifically standard errors-has not been addressed.In this article, we focus on identifiability and parameter constraints in GAMs.
In some cases, the name "GAM" may refer specifically to the case that   follows an exponential family distribution with natural parameter   and mean   = (  |x  ) such that   = ℎ(  ).We consider this important scenario in Section 3.However, the following discussion of parameter identifiability is not unique to exponential families.
The imposition of a linear constraint changes the model being fit, and therefore will affect subsequent inferences.All constraints lead to identical fitted values of   , and point estimates of   that differ only up to additive constants.However, different constraints will lead to different standard errors of the estimated   .Wood et al. (2013, p. 349) suggest that centring constraints (Section 2.3) should produce the smallest standard errors, a claim that is again made in the documentation for the popular mgcv software for fitting GAMs (Wood, 2015), where centring constraints are imposed by default.
In this article, we show that whether or not centring constraints yield optimal standard errors depends on the parameterization of the model (Theorem 1 in Section 3).Corollary 1 in Section 3 establishes the more practical conclusion that in the typical case of a GAM with a natural exponential family response having mean   , the Gaussian distribution is the only distribution for which centring constraints are optimal.However, the simulation results that we report in Section 4 suggest that the nonoptimality of centring constraints in non-Gaussian natural exponential family GAMs is not a practical concern for binary and Poisson data, except in the small-sample binary data case.
A constraint must be placed on each   (Section 2.3) during estimation (Section 2.2).Any constraint that prevents the addition of arbitrary constants to   will result in an identifiable model and yield identical fitted values for   (Wood, Scheipl & Faraway, 2013, Section 4).Prominent examples include "centring" or "sum-to-zero" constraints ∫   ()d = 0 and "point" constraints   (0) = 0.However, different constraints will lead to different vertical scaling of   , and hence different standard errors of the estimates.

Low-rank Penalized Regression
Inferences about the infinite-dimensional   are facilitated by a basis function approximation   () ≈  1 () 1 + • The requirement that   are identifiable (Section 2.1) is replaced in model fitting by the requirement that (, ) are identifiable (Proposition 1).This identifiability depends on the model, observed covariates and also the penalty used in the maximum penalized likelihood algorithm.
Let S  ∈ ℝ   ×  be a symmetric, positive semi-definite penalty matrix, having rank(S  ) =   ≤   − 1, and S  1   = 0   where 1   and 0   are   -dimensional vectors of 1s and 0s, respectively.Restricting attention to penalties with S  1   = 0   is fundamentally important in the present discussion of identifiability in the presence of an intercept.This means that constant functions, that is, for all  ∈ ℝ,   () =  for some  ∈ ℝ-and, hence, the intercept  -are not penalized, giving S  an interpretation as a "roughness" penalty; see Green & Silverman (1994, Section 1.2.2) for a motivating discussion.The full penalty matrix is Let   ∼  (  ; ) independently for  = 1, … , , with densities (  ;   , ) depending on the parameters of interest   , as well as potentially additional shared parameters  ∈ ℝ  .Define the quadratic penalty function ,   > 0, and  = (, ).This is referred to as a ridge penalty, and is analogous to placing independent (improper) Gaussian priors on   in a Bayesian context.A penalized likelihood for (, ) is The unknown parameters, , are estimated by marginal likelihood or generalized cross-validation (Wood, 2011), and the invariance of their estimation to the choice of constraint depends on the estimation method (Wood, Scheipl & Faraway, 2013).However, estimation of  does not affect identifiability, and we treat  as known for the remainder of the article.
Estimation should proceed by maximizing ln (, ; , y), and deriving standard error estimates using its inverse curvature at the mode.However, Proposition 1 establishes that (, ; , y) does not uniquely identify (, ).
The proof of Proposition 1 is found in the Appendix to this article.A constraint is required to make (, ) identifiable.This constraint will change the likelihood, and its curvature, and hence standard errors of the estimates, in a manner that depends on the constraint chosen. Let The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11786 Form the eigen-decomposition

and orthonormal, and
This decomposition splits  -and hence  -into a "penalized" part  R , and an "unpenalized" part  F .The latter can be interpreted as representing an underlying polynomial component of  , and the former as a deviation from this polynomial.The rank-deficiency of S implies that only the deviation from the polynomial is penalized, and this implication is made explicit by this decomposition; see Wood (2017, Section 6.6.1) for further details.
The vector form of the model is now The penalized likelihood is: The penalized parameter  R is always identifiable, but analogous to an unpenalized (generalized) linear model, the components of (,

Linear Constraints
Constraints are imposed on   by imposing constraints on   during estimation.Linear constraints of the form c ⊤    = 0, c  ∈ ℝ   are preferred, because: (a) they are possible to implement explicitly by transforming B, without requiring constrained optimization techniques, and hence are compatible with the substantial literature on optimization for fitting generalized additive models (Wood, 2011;Wood, Pya & Säfken, 2016); and (b) the standard errors of estimated  subject to linear constraints are as straightforward as those of the unconstrained , where the theory of general constrained maximum likelihood estimators is more complicated (Aitchison & Silvey, 1958), even more so when some parameters in the unconstrained model are unidentifiable (Luo et al., 2016).
Placing a constraint on a parameter reduces the complexity of the model being fit, and it can be expected that a constrained parameter should be estimated more precisely than an unconstrained parameter.In the extreme cases, an unconstrained, unidentifiable parameter will have an estimate with infinite variance, while a parameter fully constrained to be equal to a point will be estimated with zero variance.The two constraints that appear prominent in applications and software involving generalized additive models are (a) "centring", also called "sum-to-zero" constraints, It is of practical interest to ask which constraint, out of all possible constraints, should lead to the smallest variance out of all possible constraints.Wood et al. (2013, Section 4) suggest that the sum-to-zero constraint should provide the narrowest confidence bands out of any constraint.We show in Section 3 that: (a) for a general response, this claim depends on the parameterization of the model (Theorem 1), and (b) for natural exponential family response, this claim is only true for the Gaussian distribution (Corollary 1).
Let   = Z   c  , where  c  ∈ ℝ   −1 and Z  ∈ ℝ   ×(  −1) is defined as the trailing   − 1 columns, Z  = H ,2∶  , of the "Householder" matrix  ) , so the combined constraint on all   is C ⊤  = 0, and  = Z c .This process is referred to by Wood (2017) as "absorbing the constraint into the basis".
To avoid repetition, the superscript c is used to denote a quantity defined in Section 2.2, with B  replaced by B c  .The vector form of the model is now where ) .
The estimated smooth functions are f (X  ) = B  β ,  ∈ {1, … , }, and (now) depend on the constraints through β.The choice of constraint affects the standard error of f (x  ).Let For any x ∈ ℝ  and  ∈ ℕ, the variance of the estimated vector of function evaluations, The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11786 is approximated as Pointwise variances are given by the diagonal elements: One strategy for choosing a constraint is to minimize the total variance attributed to the estimated function, tr ( Σ (c  ) ) . Theorem 1 characterizes when centring constraints are optimal.
A referee pointed out that the dependence of Theorem 1 on G ∝ I  implies that a reparameterization of the model could change the optimality of the centring constraint; the centring constraint is optimal if the log-likelihood is a quadratic function of the linked parameter ℎ(), and this depends on both the choice of ℎ and .It is of practical interest to consider more explicitly a class of common GAMs, and the implications of Theorem 1 therein.
Suppose that   independently follow natural exponential family distributions, as defined by Lehmann & Casella (1998, Eq. 5.2): Here is the natural parameter that is modelled additively, and  is a single fixed dispersion parameter.For simplicity, we take the statistics  (  ) defined in Lehmann & Casella (1998, Eq. 5.2) to be  (  ) =   .Distributions having densities of the form specified in Equation ( 1) include the Gaussian, Poisson, binomial and gamma families.However, the only such distributions such that G ∝ I  are the Gaussians.
Lemma 1. Suppose   ,  ∈ {1, … , } independently follow a distribution having density (1) for some choices of the functions  and  and parameter The proofs of Theorem 1 and Lemma 1 are found in the Appendix.In view of Lemma 1, Corollary 1 specializes Theorem 1 to GAMs with a natural exponential family response, and follows immediately from Theorem 1 and Lemma 1: Corollary 1. Suppose   ,  ∈ {1, … , } independently follow a distribution having density of the form specified in Equation ( 1) for some choices of the functions  and  and parameter   .Then the following statements hold: We conclude that the only exponential family distribution for which centring constraints are optimal is the Gaussian distribution.

EMPIRICAL RESULTS
In this section, we report an empirical investigation of (a) the statistical properties of estimates of f (x) and ℎ(μ  ) under various constraints for Gaussian data, and (b) the existence of constraints leading to smaller standard errors for f (x) in the case of non-Gaussian data.We find that (a) estimates of f (x) and ℎ(μ  ) have close to nominal coverage irrespective of the constraint used, with centring constraints leading to the narrowest confidence bands, and (b) constraints leading to smaller standard errors for f (x) can be found.Both conclusions are predicted by Theorem 1.
In the case of (b), the empirical difference appears large for small-sample binary data (Table 2), and otherwise small.Code to reproduce all results that we report is provided at https://github .com/awstringer1/identifiability-gam-paper-code.
In this simulation study, we consider three different constraints: the centring constraint, c stz  = B ⊤  1, and two point constraints, )) , where  0 =   or  0 = 0.1.These correspond to the functional constraints ∫ 1 0   ()d = 0 and   ( 0 ) = 0. To summarize the results, we report the average across-the-function root mean square error, and coverage, where should cover the true value of   (  ) in approximately 95% of the simulated cases.The average and empirical standard deviation of each of RMSE  , covr  , and se  ,  = 1, 2, are reported for each of the three constraints, across the simulations.We expect that (a) RMSE  and covr  should be similar for each constraint, but (b) se  should be smallest for c stz and larger for the two c pc constraints.
The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11786 Our observed results are summarized in Table 1.For both functions, the estimated mean RMSE and corresponding coverage proportions are close for the three constraints, with overlapping 95% confidence intervals based on the results from the 1000 simulations.The average across-the-function standard errors are always smallest for the sum-to-zero constraints, and the confidence intervals for the three constraints do not overlap.These results are consistent with Theorem 1(a) and Corollary 1(a).

Alternative Constraints for Non-Gaussian Exponential Family Response
Corollary 1(b) states that, for non-Gaussian responses from natural exponential families, there always exists a constraint leading to an estimate f with lower across-the-function standard error than the sum-to-zero constraint.It is of interest to assess how much smaller this standard error might typically be.However, the result is not constructive, yielding no method for finding such a constraint in practice.Here we obtain such constraints by numerically minimizing the across-the-function variance with respect to c  for each of 1000 simulated sets of Poisson and binary data, with the aim of assessing whether such constraints really yield a practically meaningful difference in confidence interval width.We observe that the centring constraint is practically indistinguishable from the optimal ones in these cases, and conclude that the centring constraint remains a reasonable default choice in practice, even for distributions for which it is not theoretically optimal.
We generate 1000 simulated data sets from (a) Poisson(  ), ln   = 1 +  (  ) and (b) Bernoulli(  ), ln(  ∕(1 −   )) = 0 + 3 (  ), with single smooth function  () = sin(2).We computed the sum-to-zero constraint c = B ⊤ 1, and the optimal constraint ĉ = argmin tr( Σ(c)) using numerical optimization.Table 2 reports  In every simulation, a constraint producing narrower across-the-function standard deviation bars than c stz was found.The corresponding difference in the estimated standard errors only appears to be large in the case of small-sample binary data, where the use of optimal constraints may lead to more precise inferences.

PRACTICAL ILLUSTRATION
Wood (2017, Section 5.3) builds and validates a generalized additive model for  = 5,111 daily death counts as a function of air pollution and temperature in Chicago over an unspecified period.The data are originally from the National Morbidity, Mortality, and Air Pollution Study (Peng & Welty, 2004) and were obtained as the object chicago in the R package gamair (Wood, 2017).Daily death counts are modelled as Poisson random variables with a log-mean depending on smooth functions of time (number of days, centred), ozone, temperature and PM10 concentration.
We work with the final model described by Wood (2017, p. 248), with log-mean depending on  1 (time),  2 (o3, tmp) and  3 (pm10), and focus on the univariate smooths for time and PM10 concentration.We fitted this model with both sum-to-zero constraints c stz 1 , c stz 3 as well as point constraints c pc 1 , c pc 3 at the mean of each variable.The data were too large for finding the optimal constraint to be feasible, but the simulation study reported in Section 4.2 suggests that the final result is not likely to be practically different than that of the sum-to-zero constraint.Both fits yielded identical estimates and standard errors for the log-mean at the observed data points, DOI: 10.1002/cjs.11786 The Canadian Journal of Statistics / La revue canadienne de statistique   emphasizing that the constraints only affect the vertical scaling and uncertainty quantification of the estimated smooth functions, and not the corresponding model fitted values.
Figure 1 shows the fitted curves, vertically scaled so that the estimates are identical, but with different standard errors, even at the large sample size of  = 5111.Figure 1b shows the nearly linear estimate for f3 (pm10), having almost identical standard errors for the two types of constraints.The difference in Figure 1a is more noticeable for the very wiggly estimate of f1 (time).The difference in standard errors for the sum-to-zero and point constraints are larger at the peaks of the estimated curve, where there is more uncertainty about the function shape, than in the areas between peaks, where the function is smoothly increasing or decreasing.

DISCUSSION
We defined the   to be univariate functions, in order to keep the notation clear.Generalized additive models allow for multivariate smoothing to accommodate covariate interactions.The results in this article apply to multivariate   , as described by Wood, Scheipl & Faraway (2013).
A restriction S  1   = 0   was placed on the penalty matrices S  .This is a direct result of considering roughness penalties as defined by Green & Silverman (1994).The need for identifiability constraints comes from the rank-deficiency of S  .If S  is full rank, identifiability constraints are not required, and the results that we have outlined in the present article do not apply.
It is plausible that the optimality of the constraint depends on the departure from log-quadradicity of the response likelihood, as measured by how far G is from a multiple of the identity matrix.The Poisson would be closer than the Bernoulli in this sense, and this is reflected in the experiments in Section 4. Construction of optimal linear constraints for general response distributions remains an interesting open problem.
A closely related concept to identifiability is concurvity, the additive model analogue to collinearity in linear regression models.The definition of concurvity originally given by Hastie & Tibshirani (1990, p. 121) applies to linear smoothers of the form f = Sy for some matrix S, although their case studies suggest that the concept applies more broadly.See Amodio, Aria & D'Ambrosio (2014) for a more recent exposition.The identifiability conditions described in Sections 2.2 and 2.3 depend on the matrix of basis function weights evaluated at the observed data being nonsingular.While identifiability of the intercept was the primary focus of the present article, this condition may also be used to detect concurvity in practice, by checking the conditioning of B cF after constraints have been applied.Wood et al. (2013, Section 4) point out that, when implemented as described in this article, different constraints can lead to different likelihoods.In cases where this property is undesirable, they recommend only constraining  F , because  R is always identifiable.In view of Section 2.3, DOI: 10.1002/cjs.11786 The Canadian Journal of Statistics / La revue canadienne de statistique this outcome can be achieved by manually modifying B F  until 1  is no longer in its range space.Explicit details are given in the Appendix.

FIGURE 1 :
FIGURE 1: Estimates (-) and standard errors under the sum-to-zero (-) and point (• • •) constraints c stz c pc for the Chicago air pollution data of Section 5, scaled vertically to have identical point estimates.
the difference in scaled across-the-function standard deviation, SE(c stz ) − SE(ĉ)

,
bold), and point constraints (c pc ) at the covariate mean  0 =  and a point near the end of covariate range,  0 = 0.1, for the Gaussian data simulation of Section 4.1.The sum-to-zero constraints have the lowest empirical standard error out of the constraints considered.
• • +    ()   , where the   (⋅) are known basis functions and = (  ) ∈{1,…,} ∈ ℝ  ,   = (  ) ∈{1,…,  } ∈ ℝ   are  =  1 + • • • +   unknown coefficients.We require that constant functions are reproducible by the basis functions, so that for any  in the interior of the joint support of  1 (⋅), … ,    (⋅), assumed to be a nonempty interval in ℝ,    () = 1.This is satisfied by splines(de Boor, 2001, p. 88), and is a reasonable general requirement of a basis.We further assume that the support of the basis functions covers the range of the observed covariates, so that  1 (  ) + • • • +    (  ) = 1 for all  ∈ {1, … , },  ∈ {1, … , }.Inferences about (, ) are based on maximum penalized likelihood, yielding estimates (α, β), and hence estimates f automatically satisfies c ⊤    = 0. Therefore, the linear constraint c ⊤    = 0 is implemented by replacing   by  c  , and (hence) B  by B c  = B  Z  , and proceeding as described in Section 2.2.Let C and ( ,  cR ,  cF ) ∈ ℝ 1+− are the parameters to be esti-Imposing constraints that remove the invariance of the model to the addition of arbitrary constants to each   removes 1  from the range space of each B c  .This makes (1  ∶ B cF ) nonsingular, and hence the parameters in the constrained model are identifiable.

TABLE 1 :
Mean and standard deviation over 1000 simulations of three performance measures for

TABLE 2 :
Average and standard deviation for the difference in scaled across-the-function standard deviation between the sum-to-zero c stz and optimal ĉ constraints, based on 100 simulated Poisson andBernoulli data sets at each sample size, as described in Section 4.2.
The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11786