### Discussion on the paper by Rigby and Stasinopoulos

- Top of page
- Abstract
- 1. Introduction
- 2. The generalized additive model for location, scale and shape
- 2.2. Model estimation
- 2.3. Comparison of generalized additive models for location, scale and shape and hierarchical generalized linear models
- 3. The linear predictor
- 3.2. Additive terms
- 3.2.1. Cubic smoothing splines terms
- 3.2.2. Parameter-driven time series terms and smoothness priors
- 3.2.3. Penalized splines terms
- 3.2.4. Other smoothers
- 3.2.5. Varying-coefficient terms
- 3.2.6. Spatial (covariate) random-effect terms
- 3.2.7. Specific random-effects terms
- 3.3. Combinations of terms
- 3.3.1. Combinations of random-effect terms
- 3.3.2. Combinations of random effects and spline terms
- 3.3.3. Combinations of spline terms
- 4. Specific families of population distribution
*f*(*y*|*θ*) - 4.2. Specific distributions
- 5. The algorithms
- 6. Model selection
- 6.2. Model selection, inference and diagnostics
- 7. Examples
- 8. Conclusions
- Acknowledgements
- References
- Discussion on the paper by Rigby and Stasinopoulos
- References in the discussion
- Appendices

Peter W. Lane (GlaxoSmithKline, Harlow)

I congratulate Robert Rigby and Mikis Stasinopoulous on their addition to the toolbox for analytical statistics. They have clearly been working towards the present generality of the generalized additive model for location, scale and shape for several years and have developed the supporting theory in conjunction with a software package in the public domain R system. The model includes many of the modelling extensions that have been introduced by researchers in the past few decades and provides a unifying framework for estimation and inference. Moreover, they have found other directions in which to extend it themselves, allowing for modelling of further parameters beyond the mean and variance and with a much wider class of distributions.

This is a very extensive paper, and it would take much longer than the time that is available today to get to grips with the many ideas and issues that are covered. Two particular aspects encourage me to go away to experiment with the new tool. One is the inclusion of facilities for smooth terms, which have much potential for practial use in handling relationships that must be adjusted for, without the need for a parametric model. I am particularly glad to see facilities for smoothing made available as an integrated part of a general model, unlike the approach that is taken in some statistical software. The other aspect is the provision for non-linear hyperparameters, which I experimented with myself in a class I called generalized non-linear models and made available in GenStat (Lane, 1996). The structure of the generalized additive model for location, scale and shape allows such parameters to be estimated by non-linear algorithms, involving the inevitable concerns over details of the search process, without having to sacrifice the benefits of not having these concerns within the main generalized additive parts of the model.

I am surprised not to see the beta distribution included in the very extensive list of available distributions. In fact, none of the distributions that are listed there are suitable for the analysis of continuous variables observed in a restricted range. In pharmaceutical trials in several therapeutic areas, responses from patients are gathered in the form of a visual analogue scale. This requires patients to mark a point on a line in the range [0,1] to represent some aspect under study, such as their perception of pain. Some of my colleagues (Wu *et al.*, 2003) have investigated the analysis of such data by using the beta distribution, and it would be useful to see how to fit this into the general scheme.

I am very pleased to see that facilities for model checking are also provided and feature prominently in the illustrative examples in this paper. These are invaluable in helping to understand the fit of a model, and in highlighting potential problems.

I would like to raise three concerns with the paper. The main one is with the use of maximum likehood for fitting models with random effects. I am under the impression that such an approach in general leads to biased estimators, and that it is preferable to use restricted maximum likehood. This strikes me as indicating that the generalized linear mixed model and hierarchical generalized linear model approaches are more appropriate for those problems that come within their scope.

My experience with general tools for complex regression models has given me a sceptical outlook when presented with a new one. All too often, I have found that models cannot be applied in practice without extensive knowledge of the underlying algorithms to cope with difficulties in start-up or convergence. As a result, the apparent flexibility of a tool cannot actually be used, and I have to make do with a simpler model than I would like because of difficulties that I cannot overcome. I fear that the disclaimer in Section 5 about potential problems with the likelihood approach for these very general models may signal similar difficulties here. It is noticeable that three of the illustrative examples involve large numbers of observations (over 1000) and the other two, still with over 200 observations, have few parameters.

I am also concerned by the arbitrary nature of the generalized Akaike information criterion that is suggested for comparing models. The examples use three different values, 2.0, 2.4 and 3.0, for what I can only describe as a ‘fudge factor’, and they include no comment on why these values are used rather than any others. I am aware that, with large data sets, automatic methods of model selection tend to lead to the inclusion of more model terms than are needed for a reasonable explanation; we need a better approach than is offered by these information criteria.

However, I appreciate that most of my concerns can probably be levelled at any scheme for fitting a wide class of complex models. So I am happy to conclude by proposing a vote of thanks to the authors for a stimulating paper and a new modelling tool to experiment with.

Simon Wood (University of Glasgow)

I would like to start by congratulating the authors on a very interesting paper, reporting an impressive piece of work. It is good to see sophisticated approaches to the modelling of the mean being extended to other moments.

The paper is thought provoking in many ways, but I am particularly interested in picking up on Section 3.2.3 and considering how the use of penalized regression spline terms might lead to some simplifications, and perhaps improvements, with respect to fitting algorithms and inference for at least some models in the generalized additive model for location, scale and shape class. For example, if the body mass index model of Section 7.1 is represented by using relatively low rank cubic regression spline bases for *h*_{1}–*h*_{4} then equation (8) can be rewritten as

where the **A**^{[j]} are model matrices and the *θ*^{[j]} are coefficients to be estimated. If *θ*^{′}=(*θ*^{[1]′},*θ*^{[2]′},*θ*^{[3]′},*θ*^{[4]′}), then the associated penalty on each *h*_{j} can be written as . Given smoothing parameters *λ*_{j}, model estimation can then proceed by direct maximization of the penalized likelihood of the model

by using Newton's method, for example. Following the authors, smoothing parameter estimation by the generalized Akaike information criterion (GAIC) is also straightforward (if computationally time consuming), given that the estimated degrees of freedom for each model parameter are

where **H**_{θ} is the negative Hessian of the unpenalized likelihood with respect to *θ*. Furthermore, an approximate posterior covariance matrix for the model coefficients can be derived:

Fig. 14 illustrates the results of applying this approach and should be compared with Fig. 2 of Rigby and Stasinopoulos. All computations were performed using R 2.0.0 (R Development Core Team, 2004). For this example, *h*_{1} was represented by using a rank 20 cubic regression spline whereas *h*_{3} and *h*_{4} were each represented by using rank 10 cubic regression splines (class cr smooth constructor functions from R library mgcv were used to set up the model matrices and penalty matrices). Given smoothing parameters, the penalized likelihood was maximized by Newton's method with step halving, backed up by steepest descent with line searching, if the Hessian of the penalized log-likelihood was not negative definite. Constants were used as starting values for the functions, these being obtained by fitting a model in which *h*_{1}–*h*_{4} were each assumed to be constant. Rapid convergence is facilitated by first conditioning on a moderate constant value for *h*_{4} and optimizing only *h*_{1}–*h*_{3}. The resulting *h*_{1}–*h*_{3}-estimates were used as starting values in a subsequent optimization with respect to all the functions. The two-stage optimization helps because of the flatness of the log-likelihood with respect to changes in *h*_{4} corresponding to *τ*>30. This penalized likelihood maximization was performed using ‘exact’ first and second derivatives. The smoothing parameters were estimated by GAIC minimization, with #=2. The GAIC was optimized by using a quasi-Newton method with finite differenced derivatives (R routine optim). The estimated degrees of freedom for the smooth functions were 20, 9.2, 5.9 and 7.4, which are higher than those which were obtained in Section 7.1, since I used #=2 rather than #=2.4. Fitting required around a fifth of the time of the gamlss package, and with some optimization and judicious use of compiled code a somewhat greater speed up might be expected.

So the direct penalized regression approach to the generalized additive model for location, scale and shape class may have the potential to offer some computational benefits, as well as making approxim-ate inference about the uncertainty of the model components quite straightforward. Clearly, then, this is a paper which not only presents a substantial body of work but also suggests many further areas for ex-ploration, and it is therefore a pleasure to second the vote of thanks.

The vote of thanks was passed by acclamation.

M. C. Jones (The Open University, Milton Keynes)

It is excellent to see three- and four-parameter distributional families being employed for continuous response variables in the authors’ general models. My comments on this fine and important paper focus on Section 4.2.

First, there are three-parameter symmetric families on ℜ, the third parameter, in addition to location and scale which are here set to 0 and 1, controlling the tail weight. The Student *t*-family, of course, ranges from the normal distribution to distributions with very heavy, power, tails, such as the Cauchy distribution. The power exponential family, in contrast, ranges from as light tailed a distribution as the uniform through the normal and double-exponential distributions, but retaining an exponential nature in the tails. Rider's (1958) rather overlooked distribution with density

might be considered as it ranges all the way from uniform to power tails (including the Cauchy but not the normal).

Second, there are four-parameter families on ℜ, additionally allowing skewness. Johnson's *S*_{U}-family is cited: its simple skewness device, which works fine for the inverse sinh transformation, does not always accommodate skewness attractively, because for other transformations skewness can increase and then decrease again as *ν* (now denoting the skewness parameter) increases. The symmetric *S*_{U}-distribution, which has the normal as least heavy-tailed member, can be extended much of the way towards uniformity by what I call the ‘sinh–arcsinh’ distribution:

This seamlessly unites the best aspects of the Johnson *S*_{U}- and Rieck and Nedelman (1991) sinh–normal distributions. It can readily be ‘skewed’ through

The other three-parameter distributions that were mentioned earlier can be ‘skewed’ either by special means such as Jones and Faddy (2003) for the *t*-distribution or by

(similar to Nandi and Mämpel (1995)) for the exponential power (with a natural analogue for Rider's distribution). Alternatively, general skewing methods such as Azzalini's (1985)*g*(*y*)=2 *f*(*y*) *F*(*λ**y*) and the two-piece approach

(Fernandez and Steel, 1998) can be considered.

The remainder of the distributions in Section 4.2 live on ℜ^{+}. Three of the four employ the much overrated Box–Cox transformation. A big disadvantage, at least for the purist, is that the Box–Cox transformation requires messy truncated distributions for *z* with the truncation point depending on the parameters of the transformation. The authors recognize this elsewhere (Rigby and Stasinopoulos, 2004a,b). A better alternative, if one must take the transformation approach, might be to take logarithms and then to employ the wider families of distributions genuinely on ℜ, such as those above.

But there are many distributions on ℜ^{+} directly including, finally, the generalized gamma family. My only comment here is that this is a well-known family with a long history before an unpublished 2000 report, e.g. Amoroso (1925), Stacy (1962) and Johnson *et al*. (1994), section 8.7.

John A. Nelder (Imperial College London)

In my view the use of penalized likelihood (PL) for estimating (*σ*,*ν*,*τ*) as well as *μ* in expression (4) cannot be justified. The use of restricted maximum likelihood (REML) shows that different functions must be maximized to estimate mean and dispersion parameters. For another misunderstanding of this point see Little and Rubin (2002), section 6.3. The generalization of REML by Lee and Nelder (1996, 2001) shows that an adjusted profile *h*-likelihood (APHL) should be used for estimation of dispersion parameters. Consider the random-effect model: for *i*=1,…,*n* and *j*=1,2

where *u*_{i}∼*N*(0,*λ*) with a known *λ* and *e*_{ij}∼*N*(0,*σ*^{2}) are independent. Here their PL estimator gives with and , *w*=*λ*/(*λ*+*σ*^{2}/2) and . This has a serious bias, e.g. when *λ*=∞ (i.e. *w*=1). Lee and Nelder (2001) showed that the use of APHL in equation (15) gives a consistent REML estimator. PL has been proposed for fitting smooth terms such as occur in generalized additive models. However, in random-effect models the number of random effects can increase with the sample size, so the use of the appropriate APHL is important. If appropriate profiling is used the algebra for fitting dispersion is fairly complicated; I predict that for fitting kurtosis it will be enormously complicated.

Lee and Nelder use extended quasi-likelihood for more general models, where no likelihood is available: for its good performances see Lee (2004). When the model allows exact likelihoods they use them in forming the *h*-likelihood; even with binary data the *h*-likelihood method often produces the least bias compared with other methods, including Markov chain Monte Carlo sampling (Noh and Lee, 2004).

Youngjo Lee (Seoul National University)

I am unsure by how much the generalized additive model for location, scale and shape class is more general than the hierarchical generalized linear model (HGLM) class of models. Recently, the latter class has been extended to allow random effects in both the mean and the dispersion (Lee and Nelder, 2004). This class enables models with various heavy-tailed distributions to be explored, some of which may be new. Various forms of skewness can also be generated. Although this approach uses a combination of interlinked generalized linear models, it does not mean that we are restricted to the variance function, and higher cumulants, of exponential families.

For example some models in Table 1 can be easily written as instances of the HGLM class; their beta–binomial distribution becomes the binomial–beta HGLM, their negative binomial distribution the Poisson–gamma HGLM, their Pareto distribution the exponential–inverse gamma HGLM etc. For further examples, consider a model

where *ɛ*_{i}=*σ*_{ie}_{i},*e*_{i}∼*N*(0,1), and

For *b*_{i} there are various possible distributions, For example, if where is a random variable with the *χ*^{2}-distribution with *α* degrees of freedom, then marginally the *ɛ*_{i} follow the *t*-distribution with *α* degrees of freedom. Alternatively we may assume that

An advantage of the normality assumption for *b*_{i} is that it allows correlations between *b*_{i} and *v*_{i}, giving an asymmetric distribution; further complicated random-effect models can be considered. For a more detailed discussion of this and the use of other distributions see Lee and Nelder (2004). In this way we can generate new models which have various forms of marginal skewness and kurtosis. It is not clear, however, that *ν* and *τ* in Rigby and Stasinopoulos's equation(4) can be called the skewness and kurtosis.

In summary, models in this paper can generate potentially useful new models, but these will require the proper use of *h*-likelihood if they are to be useful for inferences.

Mario Cortina Borja (Institute of Child Health, London)

I congratulate the authors for a very clear exposition of the foundations of a large class of models, and especially for providing a flexible computational tool to fit and analyse these models. One of the most appealing aspects of the R library that has been written by the authors is how easy it is to incorporate new distributions. I considered the von Mises distribution with density

where *I*_{0} is the modified Bessel function of the first kind and order 0,0<*θ*2*π*,0<*μ*2*π*, is the location parameter and *κ*0 is the scale parameter; for *κ* large the distribution has a narrow peak, whereas if *κ*=0 the distribution is the uniform distribution on (0,2*π*]. This distribution is widely used to model seasonal patterns; it is a member of the exponential family and may be modelled in the context of the generalized additive model for location, scale and shape (GAMLSS) by using the link functions *μ*=2 tan^{−1}(LP) and *κ*=exp(LP), where LP is a linear predictor.

As an example of using the GAMLSS to model circular responses, I have analysed the number of cases of sudden infant death syndrome (SIDS) in the UK by month of death between 1983 and 1998; these data appear in Mooney *et al.* (2003) and were corrected to 31-day months. Though it is not easy to decide on an optimal model, one strong contender, based on the Schwarz Bayesian criterion, fits a constant mean *μ* (indicating a peak incidence in January) and a natural cubic spline with three effective degrees of freedom as a function of year of death for the scale parameter *κ*. The fitted smooth curve for this parameter (Fig. 15) may reflect the effect of the ‘back-to-sleep’ campaign that was implemented in the early 1990s which reduced the number of SIDS cases by 70% in the UK; it corresponds to a dampening of the seasonal effect on SIDS.

Non-symmetric circular distributions and zero-inflated distributions can be modelled as mixtures, and I wonder whether it would be easy to implement these in a GAMLSS.

N. T. Longford (SNTL, Leicester)

This paper competes with Lee and Nelder (1996) and their extensions, conveying the message that for any data structure and associations that we could possibly think of there are models and algorithms to fit them. But now models are introduced even for some structures that we would not have thought of …. I want to rephrase my comment on Lee and Nelder (1996) which I regard equally applicable to this paper. The new models are top of the range mathematical Ferraris, but the model selection that is used with them is like a sequence of tollbooths at which partially sighted operators inspect driver's licences and road worthiness certificates.

Putting the simile aside, let the alternative models that are considered in either of the examples be 1,…,*M*, and the estimators that would be applied, if model *m* were selected, , each of them unbiased for the parameter of interest *θ*, and having sampling variance estimated without bias by , *if* model *m* is appropriate: not when it is selected, but when it is valid! Model selection, by whichever criterion and sequence of model comparisons, leads to the estimator

where *I*_{m} indicates whether model *m* is selected (*I*_{m}=1) or not (*I*_{m}=0). This is a mixture of the single-model-based estimators; in all but some degenerate cases it is biased for *θ*. Further, var() is conventionally estimated by

assuming that whichever model is selected is done so with certainty. This can grossly underestimate the mean-squared error of , and does so not only because is biased. The distribution of the mixture is difficult to establish because the indicators *I*_{m} are correlated with .

A misconception underlying all attempts to find *the* model is that the maximum likelihood assuming the most parsimonious valid model is efficient. This is only asymptotically so. For some parameters (and finite samples), maximum likelihood under some invalid submodels of this model is more efficient because the squared bias that is incurred is smaller than the reduction of the variance. Proximity to asymptotics is not indicated well by the sample size because information about the parameters for the distributional tail behaviour is relatively modest in the complex models engaged.

Adrian Bowman (University of Glasgow)

I congratulate the authors on a further substantial advance in flexible modelling. The generalized linear model represented a major synthesis of regression models by allowing a wide range of types of response data and explanatory variables to be handled in a single unifying framework. The generalized additive model approach considerably extended this by allowing smooth nonparametric effects to be added to the list of available model components. The authors have gone substantially further by incorporating the rich set of tools that has been created by recent advances in mixed models and, in addition, by allowing models to describe the structure of parameters beyond the mean. The end result is an array of models of astonishing variety.

One major issue which this complexity raises is what tools can be used to navigate such an array of models? The authors rightly comment that particular applications provide contexts which can give guidance on the structure of individual components. Where the aim is one of prediction, as is the case in several of the examples of the paper, criteria such as Akaike's information criterion and the Schwarz Bayesian criterion are appropriate. However, where interest lies in more specific aspects of model components, such as the identification of whether an individual variable enters the model in a linear or nonparametric manner, or indeed may have no effect, then prediction-based methods are less appropriate. Even with the usual form of generalized additive model, likelihood ratio test statistics do not have the usual *χ*^{2} null distributions and the problem seems likely to be exacerbated in the more complex setting of a generalized additive model for location, scale and shape.

In view of this, any further guidance which the authors could provide on how to interpret the global deviance column of Table 2, or more generally on appropriate reference distributions when comparing models, would be very welcome.

The following contribution was received in writing after the meeting.

T. J. Cole (Institute of Child Health, London)

I congratulate the authors on their development of the generalized additive model for location, scale and shape (GAMLSS). Its flexible approach to the modelling of higher moments of the distribution is very powerful and works particularly well with age-related reference ranges.

In my experience with the *LMS* method (Cole and Green, 1992), which is a special case of the GAMLSS, it is difficult to choose the effective degrees of freedom (EDFs) for the cubic smoothing spline curves as there is no clear criterion for goodness of fit (see Pan and Cole (2004)). In theory the authors’ generalized Akaike information criterion GAIC(#) (Section 6.2) provides such a criterion, but in practice it can be supremely sensitive to the choice of the hyperparameter #. I am glad that the authors chose to highlight this in their first example (Section 7.1). With #=2.4 the shape parameter *τ* was modelled as a cubic smoothing spline with 6.1 EDFs (Fig. 2), whereas with #=2.5 it was modelled as a constant. The two most well-known cases of the GAIC are the AIC itself (where #=2) and the Schwarz Bayesian criterion (SBC) (where #=log(*n*)=9.9 here), so the distinction between 2.4 and 2.5 is clearly tiny on this scale. The use of the SBC in the example would have led to a much more parsimonious model than for GAIC(2.5).

The take-home message is that, although optimal GAMLSSs are simple to fit conditional on #, the choice of # is largely subjective on the scale from 2 to log(*n*) and can affect the model dramatically. In my view # should reflect the sample size in some way, so I prefer the SBC to the AIC. In addition it is good practice to reduce the EDFs as far as possible (Pan and Cole, 2004), which comes to the same thing. I also wonder whether a different GAIC might be applied to the different parameters of the distribution, so that for example an extra EDF used to model the shape parameter should be penalized more heavily than an extra EDF for the mean.

The **authors** replied later, in writing, as follows.

We thank all the disscusants for their constructive comments and reply below to the issues that were raised.

#### Distributions

An important advantage of the generalized additive model for location, scale and shape (GAMLSS) is that the model allows any distribution for the response variable *y*. In reply to Dr Borja, mixture distributions (including zero-inflated distributions) are easily implemented in a GAMLSS. For example, a zero-inflated negative binomial distribution (a mixture of zero with probability *ν* and a negative binomial NB(*μ*,*σ*) distribution with probability 1−*ν*) is easily implemented as a three-parameter distribution (e.g. with log-links for *μ* and *σ* and a logit link for *ν*). The beta distribution BE(*μ*,*σ*), which was suggested by Dr Lane, has now been implemented, as has an inflated beta distribution with additional point probabilities for *y* at 0 and 1.

The exponential family distribution that is used in generalized linear, additive and mixed models usually has at most two parameters: a mean parameter *μ* and a scale parameter *φ* (=*σ* in our notation). Having only two parameters it cannot model skewness and kurtosis. The exponential family distribution has been approximated by using extended quasi-likelihood (see McCullagh and Nelder (1989)) and used in hierarchical generalized linear models (HGLMs) by Lee and Nelder (1996, 2001). However, extended quasi-likelihood is not a proper distribution, as discussed in Section 2.3 of the paper, and suffers from the same skewness and kurtosis restrictions as the exponential family. The range of distributions that are available in HGLMs is extended via a random-effect term. However, the GAMLSS allows any distribution for *y* and is conceptually simpler because it models the distribution of *y* directly, rather than via a random-effect term. The level of generality of the double HGLM will be clearer on publication of Lee and Nelder (2004).

The four-parameter distributions on ℜ that were discussed by Professor Jones can be implemented in a GAMLSS. The Box–Cox *t*- and Box–Cox power exponential distributions in the paper are four-parameter distributions on ℜ^{+} for which there are fewer direct contenders. They are easy to fit in our experience and provide generalizations of the Box–Cox normal distribution (Cole and Green, 1992), which is widely used in centile estimation, allowing the modelling of kurtosis as well as skewness. Users are also welcome to implement other distributions.

#### Restricted maximum likelihood

Dr Lane and Professor Nelder highlight the use of restricted maximum likelihood (REML) estimation for reducing bias in parameter estimation. In the paper, the random-effects hyperparameters *λ* are estimated by REML estimation, whereas the fixed effects parameters *β* and random-effects parameters *γ* are estimated by posterior mode estimation, conditional on the estimated *λ*. If the total (*effective*) degrees of freedom for estimating the random effects *γ* and the fixed effects *β*_{1} for the distribution parameter *μ* are substantial relative to the total degrees of freedom (i.e. the sample size), then REML estimation of the hyperparameters *λ**and* the fixed effects (*β*_{2},*β*_{3},*β*_{4}) for parameters (*σ*,*ν*,*τ*) respectively may be preferred. This is achieved in a GAMLSS by treating (*β*_{2},*β*_{3},*β*_{4}) in the same way as *λ* in Appendix A.2.3 and obtaining the approximate marginal likelihood *l*(*ζ*_{1}) for *ζ*_{1}=(*β*_{2},*β*_{3},*β*_{4},*λ*) obtained by integrating out *ζ*_{2}=(*β*_{1},*γ*) from the joint posterior density of *ζ*=(*β*,*γ*,*λ*), giving , where and , evaluated at , the posterior mode estimate of *ζ*_{2} given *ζ*_{1}. Hence REML estimation of *ζ*_{1} is achieved by maximizing *l*(*ζ*_{1}) over *ζ*_{1}. This procedure leads to REML estimation of the scale and shape parameters and the random-effects hyperparameters.

For example, in Hodges's data from Section 7.2 of the paper, the above procedure gives the following REML estimates (with the original estimates given in parentheses): and .

Alternatively, other methods of bias reduction, e.g. bootstrapping, could be considered.

#### Model selection

Dr Lane and Professor Cole highlight the issue of the choice of penalty # in the generalized Akaike information criterion GAIC(#) that is used in the paper for model selection. The use of criterion GAIC(#) allows investigation of the sensitivity of the selected model to the choice of penalty #. This is well illustrated in the Dutch girls’ body mass index (BMI) data example from Section 7.1. The resulting optimal effective degrees of freedom that were selected for *μ*,*σ*,*ν* and *τ* and the estimated parameter *ξ* in the transformation *x*=age^{ξ} are given in Table 4 for each of the penalties #=2, 2.4, 2.5, 9.9.

Table 4. Model selected by criterion GAIC(#) with penalty # #*(criterion)* | *df* | *df* | *df* | *df* | *ξ* |
---|

2 (AIC) | 16.9 | 8.7 | 5.0 | 9.5 | 0.51 |

2.4 (GAIC) | 16.2 | 8.5 | 4.7 | 6.1 | 0.50 |

2.5 (GAIC) | 16.0 | 8.0 | 4.8 | 1 | 0.52 |

9.9 (SBC) | 12.3 | 6.3 | 3.7 | 1 | 0.53 |

The apparent sensitivity of df to # is due to the existence of two local optima. The value #=2.4 that is used in the paper is the critical value of # above which the optimization switches from one local optimum to the other. Reducing # below 2.4, or increasing # above 2.5, changes the selected degrees of freedom smoothly. Hence there are two clearly different models for the BMI, one corresponding to #2.4 and the other corresponding to #2.5.

A sensitivity analysis of the chosen model to outliers shows that the non-constant *τ*-function for #=2.4 in Fig. 2(d), and in particular the minima in *τ* at 0 and 4 years (with corresponding peaks in the kurtosis) are due to substantial numbers of outliers in the age ranges 0–0.5 and 3–5 years respectively. Consequently we believe that these peaks in kurtosis may be genuine, requiring physiological explanation. We therefore recommend the chosen model for #=2.4 as in the paper.

In our opinion the Schwarz Bayesian criterion (SBC) is too conservative (i.e. restrictive) in its model selection, leading to bias in the selected functions for *μ*,*σ*,*ν* and *τ* (particularly at turning-points), whereas the AIC is too liberal, leading to rough (or erratic) selected functions. Fig. 16 gives the selected parameter functions using the AIC. Compare this with Fig. 14 of Simon Wood. The standard errors in Fig. 16 are conditional on the chosen degrees of freedom and *ξ*, and on the other selected parameter functions. The final selection of model(s) should be made with the expert prior knowledge of specialists in the field.

Conditioning on a single selected model ignores model uncertainty and generally leads to an under-estimation of the uncertainty about quantities of interest, as discussed in Section 6.1. This issue was also raised by Dr Longford. Clearly it is an important issue, but not the focus of the current paper.

Where the focus is on whether an explanatory variable, say *x*, has a significant effect (rather than on prediction), then for a parametric GAMLSS this can be tested by using the generalized likelihood ratio test statistic Λ, as discussed in Section 6.2. The inadequacy of a linear function in *x* can be established by testing a linear against a polynomial function in *x* using Λ. The statistic Λ may be used as a guide to comparing a linear with a nonparametric smooth function in *x*, although, as pointed out by Professor Bowman, the asymptotic *χ*^{2}-distribution no longer applies, and so a formal test is not available.

#### Algorithm convergence

Dr Lane highlights the issue of possible convergence problems. Occasional problems with convergence may be due to one of the following reasons: using a highly inappropriate distribution for the response variable *y* (e.g. a symmetric distribution when *y* is highly skewed), using an unnecessarily complicated model (especially for *σ*, *ν* or *τ*), using extremely poor starting values (which is usually overcome by fitting a related model and using its fitted values as starting values for the current model) or overshooting in the Fisher scoring (or quasi-Newton) algorithm (which is usually overcome for parametric models by using a reduced step length). Hence any convergence problems are usually easily resolved. The possibility of multiple maxima is investigated by using different starting values.

#### Extensions to generalized additive models for location, scale and shape

The GAMLSS has been extended to allow for non-linear parametric terms, non-normal random-effects terms, correlations between random effects for different distribution parameters and incorporating priors for *β* and/or *λ*.

#### Conclusion

The GAMLSS provides a very general class of models for a univariate response variable, presented in a unified and coherent framework. The GAMLSS allows any distribution for the response variable and allows modelling of all the parameters of the distribution. The GAMLSS is highly suited to flexible data analysis and provides a framework that is suitable for educational objectives.