### Discussion on the paper by Ramsay, Hooker, Campbell and Cao

- Top of page
- Abstract
- 1. Challenges in dynamic systems estimation
- 2. Generalized profiling estimation procedure
- 3. Simulated data examples
- 4. Two real data examples
- 5. Generalizations and? further problems
- 6. Conclusions
- Acknowledgements
- References
- Discussion on the paper by Ramsay, Hooker, Campbell and Cao
- References in the discussion
- Appendix

**Arne Kovac** (*University of Bristol*)

Estimation of parameters of an ordinary differential equation (ODE) from noisy data is an exciting area and we have to thank the authors for bringing this challenging problem to our attention. One reason why I think that this topic is so interesting is that many applications in science and engineering employ differential equations to model relationships between variables and one of the strengths of this paper is to share so many examples. Another reason is that it gives rise to a difficult optimization problem where the target function is usually not convex and can have many local minima. Finally given how natural the desire is to determine suitable values for the parameters of an ODE it is the more surprising that this topic is relatively unexplored.

Although the ‘discovery’ of this problem is certainly the highlight of this paper, the particular approach that is followed by the authors and the use of regularization in this context are another interesting contribution. Traditionally used to balance smoothness and closeness to data, regularization has recently also been used to estimate monotone functions (Ramsay, 1998), to obtain simple approximations without artificial local extrema (Davies and Kovac, 2001) and to select parameters in linear regression (Tibshirani, 1996). In this paper the authors use a new penalty that penalizes departure from solving the ODE to make an otherwise difficult optimization problem much easier to solve. We have to thank the authors for not only providing an explicit algorithm but also for making their implementation publicly available.

One of many interesting questions is whether it is possible to assess the goodness of fit. If there were no noise at all we would just have to solve the ODE for the given set of parameters and to check whether the solution coincides with the data. With noise present and/or departures from the idealistic model this is more difficult. A set of parameters may be regarded as a good model if the residuals *r*_{i} look like noise and one way of checking this is to look at their sums on different scales and locations,

and to verify whether these are all sufficiently small, i.e. where is some estimate of the noise level. Fig. 11 shows data from the FitzHugh–Nagumo ODE with an approximation from a slightly different model. Visual inspection shows hardly any lack of approximation; however, some *w*_{j,k} exceeded the threshold. In contrast, the corresponding solution for the true value *c*=3 would have been accepted by the multiresolution criterion.

It is not quite clear to me whether we should calculate the residuals with respect to the solution to the ODE by using the parameter estimates or the aproximations from the regularization problem. These functions may considerably differ if *λ* is small. How do we then interpret the parameter estimates given that the data do not follow the trajectory of the ODE? Do we estimate parameters at all?

Another challenging problem is how to deal with possible changes over time. Is there one global set of parameters that provides a good model for all of the data? And, if this is not possible, how would one estimate the parameters locally? A partial answer may be given again by the multiresolution criterion that was sketched above. We could try to devise an algorithm that aims to find parameter values such that as many of the coefficients *w*_{j,k} as possible are below the threshold. If for any set of parameters coefficients *w*_{j,k} exist which exceed the threshold, a local version needs to be determined. Fig. 12 shows another 401 data simulated from the FitzHugh–Nagumo ODE where the parameters changed after the first half, but where the solution was calculated globally by using the true parameters from the first half. For *V* all *w*_{j,k} with *k*152 were below the threshold and for *R* even all *w*_{j,k} with *k*174. Thus the multiresolution criterion clearly indicates that the approximation is adequate for at least the first 150 data points, but that a different approximation is needed for the second half.

Further questions include statements about rates of convergence, whether there is any use in a local choice of *λ* and whether *L*_{1}-penalties would offer any improvements when the functions have discontinuities. I am convinced that this paper will stimulate plenty of research and consequently I have great pleasure in proposing the vote of thanks.

**S. Olhede** (*Imperial College London*)

I congratulate the authors on their thought provoking contribution to the estimation of parameters of ordinary differential equations (ODEs). This is an important and currently much neglected area of statistics.

The main innovation of this paper is the attempt to combine various measures of misfit into a coherent likelihood framework, so that the parameters of a system of ODEs, denoted *θ*, can be estimated. For simplicity I in this discussion take *N*=min_{i}(*N*_{i}). As the ODEs cannot be solved numerically for each posited value of *θ*, the solutions are approximated by using *B*-spline bases (see Varah (1982)), by

I am concerned that the authors do not give an automated criterion for the selection of *K*_{i}. Once the number of measurements increases (Mendes *et al.* (2003) already have used eight coupled ODEs) choosing *K*_{i} on the basis of a qualitative assessment for each output variable of the data set will become infeasible.

I would like to note that methods of combining the measures of misfit will vary in suitability dependent on the inference problem that is attacked. If we only seek to estimate *θ* then it does not matter whether as *N*∞ but only that

becomes negligible with increasing *N*. If we are interested in *prediction* of the output variables, then this fact changes.

The most important component of the procedure is the choice of regularization parameters , and the norms that are chosen for the data and model misfits, which are denoted by *α* and *γ* respectively. For large *N* with *α*=2 some further remarks can be made. Unless the data are very strongly correlated in time, *H*(*θ*|*λ*)=*O*(*N*). I point out that *K*_{i}=*K*_{i}(*N*)=*O*(*N*) and *λ*_{i}=*λ*_{i}(*N*). To provide consistent large sample theory for some *δ*>0 must be imposed. The choice of *δ* determines the rate of convergence of the approximation to the true solution. To ensure that the approximation of **x**(*t*) becomes exact for increasing sample sizes, a condition such as

- (38)

To ensure the existence of a ‘good’ solution, we need to take *K*_{i}(*N*) sufficiently large. Arguments to confirm the existence of the solution for a specific *γ*, *λ*_{i}(*N*) and *K*_{i}(*N*) can be made. *λ*_{i}(*N*) might also be chosen to account for how informative (sensitive) *x*_{i}(*t*) is to *θ*.

The specification of *λ*_{i}(*N*) needs to be automated. For method 1 proposed on page 753,

- (a)
how do we know whether the first minimum is appropriate,

- (b)
can random variability due to the errors cause many minima and order mixing of minima and

- (c)
is there a strict theoretical justification for this procedure?

Method 2 proposed on page 754 is speculative and appears to underestimate the size of the regularization parameter. I think that, if a semiparametric model is appropriate, relevant assumptions must be made about the deviation of the derivatives of the sample paths from the ODE. Another possible approach to the problem is to combine the model misfit with the data misfit by using a Bayesian formulation of the problem; see Wahba (1978). In this case the variability of each sample path of **x**(*t*) needs to be modelled. Many alternatives to a Brownian coupled set of stochastic differential equations are available. Wahba and Wang (1990) have discussed issues with the usage of generalized cross-validation for the selection of regularization parameters. Certainly the choice of loss function should be approached with some care and should be linked to the inferential problem that is addressed. Neither proposed automated procedure was actually used for the simulated data or the nylon example.

Another issue which is glossed over by the authors is model checking. The distribution of the error terms determines *H*(*θ*|*λ*). The residuals should be checked for serial correlation, which appears to be present in the nylon data set residuals; see Fig. 13(a). Determining the second-order structure of *e*_{ij} is equally important to specifying an appropriate choice of regularization parameter. A simple autoregressive length 1 model seems to explain the serial correlation; see Fig. l3(b). The time sampling is not evenly spaced; hence models such as autoregressive processes may not always be appropriate. Non-parametric methods such as runs tests could be employed to test for serial correlation of the residuals; see for example Mood (1940).

There are issues with usage of profile likelihood: see the discussion in Berger *et al.* (1999), and note that unfortunate ‘ridge maximization’ may ensue. With the observed ripples in the likelihood such effects may lead to unfortunate properties of the procedure. Some care must therefore be taken with the profile maximization.

I have outlined some questions with regard to the performance of the methods proposed. A very adventurous step has been taken by the authors to construct a coherent likelihood framework for inference of systems of ODEs. Numerous modelling, consistency and fitting issues remain, as inevitably will be the case when boldly embarking on a new area of inference: I am very pleased to join Dr Arne Kovac in thanking the authors for their innovative and challenging paper.

The vote of thanks was passed by acclamation.

**Steven M. Boker** (*University of Virginia, Charlottesville*)

I congratulate Ramsay and his colleagues on a stimulating paper that addresses a problem that has been long considered important. 80 years ago Hotelling (1927) wrote of the difficulty of estimating differential equations in the presence of error. When data are sampled from real systems and a model is estimated, this error can be divided into at least three parts: a part that is associated with the measurement instrument itself, a part that is associated with exogenous influences which propagate in time and a part that is associated with inadequacy of the model to account for the relationships between the time derivatives of the system. Separation of these sources of error from signal while simultaneously estimating parameters of a system and providing goodness-of-fit estimates for the chosen model allows model comparison. These goals are particularly problematic when the system is non-linear and realizations of the system may diverge exponentially.

Some widely used methods for parameter estimation of differential equations are variants of Kalman filtering or Kalman smoothing (Kalman, 1960; Molenaar and Newell, 2003) and methods from stochastic differential equations (Itô, 1951; Bergstrom, 1966; Singer, 1993). These forward prediction methods operate on the integral form and thus require analytic solutions to the chosen system of differential equations. One advantage of the method that is outlined in this paper is that it estimates the parameters of the differential equations directly and thus does not require the analytic solution, which for non-linear systems may be unknown. A second interesting feature of the method is that it allows separate cost functions for the equation and error parameters.

There are three practical problems that I see arising when using the approach of Ramsay and his colleagues. The first concerns the choice of the smoothing complexity parameter *λ*, some potential solutions to which the paper covers. The second problem is that it is unclear how the separation of time-independent and time-dependent error is to be accomplished such that solution uniqueness is obtained given that the smoothness *λ* must also be chosen. Perhaps a latent variable form of the differential equation in question could be specified if multivariate indicators were available for each variable (Boker *et al.*, 2004). The third problem arises when the model structure is unknown: by what metric are we to perform model comparison given the flexibility of this method? Some penalty for lack of parsimony might unify solutions to these three problems. I do not see these problems as insurmountable and I hope that Ramsay and colleagues will consider them in hopes of widening the applicability of their interesting work.

**Leonard Smith** (*London School of Economics and Political Science*)

The paper is an important contribution to parameter estimation in non-linear systems of ordinary differential equations. We lack a general coherent theory here, despite important applications ranging from the small scale industrial processes that are discussed in the paper to informing decision support in climate change (Stainforth *et al.*, 2005). I thank the authors for the chance to suggest links between their work and approaches from non-linear dynamics, as the geometric–dynamics view provides a complementary perspective on parameter estimation which might allow

- (a)
better estimation when the model structure is exact and the non-linearities are non-trivial,

- (b)
improvement in model structure when it is known to be imperfect and

- (c)
clarification of the role of stochastic dynamics.

Imperfections in the model structure reopen the question of which parameter values should be used when elements of the parameter vector *Θ**are* known from ‘theoretical considerations of other sources of information’. Even in artificial cases where the model structure and the observational noise model are known exactly, traditional approaches like least squares are likely to prove unsatisfactory, as even normally distributed input uncertainties yield outputs under the model which are not normally distributed (Judd, 2007; McSharry and Smith, 2004).

In practice we are never in that perfect model scenario; the goal of parameter estimation, and indeed state estimation, is not only unclear but also unlikely to have a single well-posed definition (Smith, 2000; Judd and Smith, 2004). Focusing on information from the dynamics rather than focusing on the statistics abandons one notion of optimality for the goal of improved consistency. One simply asks whether the model admits trajectories that are consistent with the observations (Judd *et al.*, 2004). The distribution of the durations of shadowing orbits allows parameter estimation, provides a structured approach to estimating Ramsay's *λ* and locates regions of the model state space where the system dynamics are systematically inconsistent with those of the model (McSharry and Smith, 2004). When shadowing trajectories cannot be found, we can examine the mismatch ‘errors’ of pseudo-orbits that are consistent with the observations. This has the dual aims of model improvement and of developing stochastic models which are more likely to yield useful trajectories (Judd and Smith, 2004). These models are not, however, the traditional form of stochastic models: the innovations reflect the geometric failings of the model flow in model state space and aim to allow for the attracting manifolds that are common in non-linear dissipative models. Ideally the innovation distribution will be state dependent and perhaps path dependent. The clear formulation of such truly non-linear stochastic models which respect the geometrical dynamics of the model and observations of the underlying system would prove of great value in refining the models that Ramsay and his colleagues now provide us with.

**Steven Gilmour** (*Queen Mary, University of London*)

Parameter estimation for differential equations is a topic of enormous importance and applicability, which requires much more attention from statisticians. I welcome this paper which addresses the problem from one particular viewpoint, which seems to work rather well. My own interests are in the design of experiments which will enable the parameters to be estimated efficiently.

At a simple level, a design could be chosen which optimizes some function of , as given in equation (24). Usually we would have to integrate over prior distributions for *θ* and *λ*, so this is a far from trivial task.

However, it is important that we get the basics correct. The classical principles of good design have a role to play in complex experiments, which is at least as important as their role in simple text-book experiments. The tank reactor experiment that is described in Section 1.2.2 and illustrated in Fig. 3 is typical of many experiments on dynamical systems. It is not obvious even how to describe it in classical terms. We need to identify

- (a)
the treatments—combinations of the levels of six factors, *F*_{in},*C*_{in},*T*_{in},*T*_{co},*F*_{co} and the base-line *T*_{co},

- (b)
the experimental units—these seem to be runs of the process of length *t*=4—and

- (c)
the responses from each experimental unit–time series of *C* and *T*.

Then the design, using a standard coding, is shown in Table 3.

Table 3. Design of the tank reactor experiment *Base* | *F*_{in} | *C*_{in} | *T*_{in} | *T*_{co} | *F*_{co} | *Base* | *F*_{in} | *C*_{in} | *T*_{in} | *T*_{co} | *F*_{co} |
---|

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

−1 | −1 | 0 | 0 | 0 | 0 | 1 | −1 | 0 | 0 | 0 | 0 |

−1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

−1 | 0 | −1 | 0 | 0 | 0 | 1 | 0 | −1 | 0 | 0 | 0 |

−1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

−1 | 0 | 0 | −1 | 0 | 0 | 1 | 0 | 0 | −1 | 0 | 0 |

−1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

−1 | 0 | 0 | 0 | −1 | 0 | 1 | 0 | 0 | 0 | −1 | 0 |

−1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

−1 | 0 | 0 | 0 | 0 | −1 | 1 | 0 | 0 | 0 | 0 | −1 |

−1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |

−1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

This design is poor on several counts: there is no randomization, no sensibly chosen replication, no blocking (so long-term drifts will have systematic effects), no protection of experimental units (it might be sensible to exclude the first part of the time series on each unit), no use of the factorial treatment structure (so interactions cannot be estimated) and a failure to recognize multiple strata (base-line *T*_{co} is a whole-plot factor). Such a poor design would be useless even for simple responses.

Also important are the implications of the design for the analysis. The concept of experimental units is always meaningful and implies a discreteness, even in dynamical systems. Each time that the system is disturbed by changing the level of a factor a new, discrete, error is introduced (e.g. through small uncertainties in setting the levels) and so the model should contain random unit effects.

The following contributions were received in writing after the meeting.

**Caroline Bampfylde** (*University of Alberta, Edmonton*)

I thank Ramsay, Hooker, Campbell and Cao for their contribution to the practicalities of fitting dynamical models to data and estimating model parameters. This is a task which is commonly encountered by applied scientists and the rigorous solution technique that is provided by this manuscript is most welcome.

Although Ramsay and co-authors make efficient use of matrix algebra to simplify the calculation of the derivatives in Appendix A, the resulting formulae that are presented seem to be overly complicated. I am concerned that the implementation of their methods is non-trivial, especially for many applied scientists whose interest is in the results and their application rather than the details of the method. However, I do applaud the publishing of on-line materials providing open source software and numerical code to facilitate the implementation of their techniques. It appears that the Web site http://www.functionaldata.org needs to be updated to reflect the new statistical techniques and to present some examples that users can then modify to fit to their own problems.

I should like to end my discussion with thoughts about the wider applications of the authors’ techniques. The methods have thus far been applied to systems of ordinary differential equations. Would it be possible to consider the extension to discrete time dynamical systems such as systems of difference equations? In my research I have to deal with dynamical systems both continuous and discrete in nature and a consistent technique for parameter estimation would be very useful. Have the authors considered the application to partial differential equations and integrodifferential equations which are regularly used for spatial problems? Any further extensions or generality that can be derived from their methods would be a great addition to the parameter estimation toolbox.

My thanks go to the Research Section of the Royal Statistical Society, for the opportunity to contribute to the discussion of this important paper.

**Lorenz Biegler** (*Carnegie Mellon University, Pittsburgh*)

It is a pleasure to comment on this paper. I found this paper very informative and useful and my comments are mostly from an optimization algorithm perspective. The approach that is mentioned in the paper complements strategies for dynamic optimization but specializes them with interesting statistical concepts and problem formulations.

On page 750, using the total variation penalty in equation (l2) may be advantageous, although it is not used in the analysis or the examples. Although the necessary smoothness conditions are absent, the finite dimensional analogue to equation (l2) is actually preferred over equation (11) because only a finite value of *λ* is needed to satisfy theorem 2. Also, equation (11) has the disadvantage that *λ* must approach ∞ to force PEN(·) to zero, thus leading to severe ill conditioning in the optimization. Some discussion on these numerical aspects (and possible improvements) can be found in chapter 17 of Nocedal and Wright (2006).

Sections 3 and 4 contain excellent examples that illustrate the benefits of the approach in Section 2 and also show how they apply to real world data. In the second paragraph of page 760, it should be mentioned that a Runge–Kutta *initial* value algorithm was used. The failure for this unstable system is due to this single shooting approach. Instead, if the instrumental variables were replaced with corresponding (dichotomous) boundary conditions, and the solver replaced by a corresponding boundary value solver (e.g. COLSYS or COLDAE; see Ascher and Petzold (1998)), the problem should also solve easily, just as the principal differential analysis method does. A method to do this along with a pathological parameter estimation problem is given in Tanartkit and Biegler (1995, 1996).

Section 5 is very useful in exploring future topics. More detail could be added in several areas. The exploration of differential algebraic equations (DAEs) has been done for some time and DAE systems have now been well studied and understood for parameter estimation. Ascher and Petzold (1998) provided a comprehensive discussion and summary of these systems. Many practical systems can be written as index 1 DAEs (or can be reformulated as index 1 DAEs). For these, much of the discussion in this paper could be extended directly.

Partial differential equation constrained optimization enjoys considerable current research attention and several approaches have been explored that are relevant to principal differential analysis. The authors might find Biegler *et al.* (2003) useful. Finally, for future work I think that the greatest potential of this approach is for stochastic systems. I look forward to further developments in this area with this approach.

**Emery N. Brown** (*Massachusetts Institute of Technology, Cambridge, and Harvard Medical School, Boston*)

It remains to be clearly established what the methods of Ramsay and colleagues add to current methods for differential equation model analyses.

The Ramsay analyses provide no comparisons with existing methods for analysing differential equation models. Therefore, the current work tells neither the dynamicist nor the statistician what if any improvements the new approach brings. For example, if a likelihood-based analysis had been applied to the nylon production problem as was done for the circadian data in Brown (1987) and Brown *et al.* (2000) what improvements would the new methods have provided? The likelihood-based analysis that was used in those references estimated non-linear dynamical systems models from observations with very strong serial dependence, computed confidence intervals for parameter estimates and demonstrated that estimation with approximate and exact solutions of the differential equation system gave similar answers.

Dynamical systems often have specific properties such as Hopf bifurcations, limit cycles and chaotic dynamics. Inferring these specific properties from experimental data is a fundamental question in dynamical systems analyses (Czeisler *et al.*, l989, 1999; Diks, 1999). The authors give no evidence to show that their approach would allow dynamicists to determine whether particular types of dynamic behaviour can be more reliably determined from experimental data analyses by using their methods compared with current methods.

Another fundamental question in many dynamical systems analyses in neuroscience is how to model the stochastic features of a given neural system. The smoothness constraint should reflect specific hypotheses about the stochastic features of the dynamical system. The smoothness constraint in the Ramsay analyses represents an explicit (mathematically convenient) assumption about the stochastic features of the dynamical system. It does not relate to any specific hypothesis about the physical, chemical or biological origins of the stochastic features of the systems that are studied in their examples.

A dynamical system with noise in its observation process and/or its system equation falls naturally into the state space and the partially observed systems framework. The authors miss an important opportunity to relate their work to these established paradigms.

The FitzHugh–Nagumo example does not provide a true illustration of the issues that computational neuroscientists address in relating dynamical systems models to experimental data. Time courses of actual subthreshold membrane voltage potentials of single neurons (what the FitzHugh–Nagumo model is intended to characterize) are recorded by many neurophysiologists. Estimating the dynamic properties of these data is a challenging problem being investigated by many computational neuroscientists (Koch, 2001). Would the methods of Ramsay and colleagues outperform current methods in the study of this problem?

**Sy-Miin Chow and Stacey S. Tiberio** (*University of Notre Dame*)

The authors are to be congratulated for providing a comprehensive treatment of using smoothing methods to fit non-linear ordinary differential equation models. We particularly like the proposed approach's ease of use with irregularly spaced discrete time observations, and the authors’ discussion on its diagnostic utility. We believe that the method proposed can be effectively integrated with recent advances in fitting non-linear, non-Gaussian state space models. In particular, we ask the authors to consider a non-linear continuous time state space model of the form

- (39)

- (40)

where *f* is a non-linear drift function, *h* is a (possibly) non-linear measurement function, *w*_{i}(*t*) is a Wiener (or possibly other dynamic noise) process and *e*_{j} is a vector of measurement errors.

If basis function expansion is used to obtain smoothed estimates of equation (39), the log-likelihood function *H*(*θ*,*σ*|*λ*) can then be written as a function of the innovations . Along a similar line, *H*(*θ*,*σ*|*λ*) and the penalty function can then be used as the basis for assessing misfits stemming from the dynamic model and the measurement model (equations (39) and (40)) respectively. In addition, we do see some merits in incorporating process noise in the dynamic model in equation (39) in addition to allowing for non-Gaussian measurement processes (e.g. Poisson processes; Durbin and Koopman (200l) and Fahrmeir and Tutz (1994)) in equation (40). For instance, serially independent measurement errors can play a very different role from that of dynamic noises that do show continuity over time. More research along this line is certainly warranted. Some of the recent continuous time adaptations of Monte Carlo techniques (e.g. Beskos *et al.* (2006) and Särkkä (2006)) may also be a helpful alternative or addition to the generalized smoothing approach.

Our remaining comments are mainly questions to help to pave future extensions along this line. We wonder whether the authors can comment on the relationship between the complexity of the basis functions and the choice of the smoothing parameter *λ*. Specifically, how does the role of *λ* change if the numbers of knot points and basis functions that are used are overfitting compared with underfitting the data, especially when sample sizes are small to moderate? Furthermore, if an ordinary differential equation model is fitted to data with mild process noise, can the model misspecification be partially compensated by, for example, using more basis functions to construct

**Sophie Donnet and Adeline Samson** (*University Paris Descartes*)

When a biological or physical process is measured, the regression function of the statistical model describing the observed data often derives from dynamic systems based on ordinary differential equations (ODEs). The differential system often does not have any analytical solution, leaving only the combination of estimation procedures and discretization schemes to solve the ODE. As an alternative to addressing the various problems that are involved in these methods—computational time and stability—the authors suggest an original and efficient solution. Their method relies on a basis function expansion of the dynamic process and then consists of data fitting and an equation fidelity criterion combined in a penalized log-likelihood.

We shall now stress the numerous qualities of the method. First, the fact that no discretization scheme is used to solve the problem makes it possible to consider boundary or/and distributed data problems and, most of all, side-steps the instability problems that are involved with non-continuous input functions. These discontinuous functions are common in biology or physics and constitute a major limit to the use of classical discretization schemes such as Euler or Runge–Kutta schemes in estimation algorithms. Moreover, this method seems robust to the starting parameter values, which is often a concern with a non-linear least squares approach. Furthermore, the authors provide explicit expressions for the derivatives, allowing the use of an efficient Gauss–Newton algorithm. Finally, one of the major advances of this paper is the fact that it provides accurate estimations of the confidence intervals of the estimated parameters.

Obviously, this work opens many new perspectives in the active research field of the estimation in ODE models. As stressed by the authors, many extensions can be considered. Firstly, in biology, experimental studies often consist of repeated measurements of a biological criterion obtained from a population of subjects. The statistical parametric approach that is commonly used to analyse these data is mixed models. The extension of the estimation method that is proposed by Ramsay and his colleagues to mixed models would be an interesting alternative to classical methods that are based on discretization schemes. Secondly, it would be of considerable interest to develop such a method for stochastic differential equations, which are a natural extension of the models that are defined by ODEs, as it allows taking into account errors that are associated with misspecifications and approximations in the dynamic system.

**Michael Dowd** (*Dalhousie University, Halifax*)

My congratulations go to the authors for their interesting and topical study. Rigorous statistical examination of estimation problems for systems that are governed by differential equations (DEs) is important and timely. Such models are the theoretical foundation for many scientific fields and the synthesis of dynamical models and data is a pressing issue, e.g. for data assimilation (Lewis *et al.*, 2006).

This study offers a unique approach to parameter estimation for DEs by using a weak constraint formulation and exploiting the functional nature of the system state. The ‘parameter cascade’ appears an effective strategy for estimating different types of parameter. It offers a viable alternative to non-linear regression (Thompson *et al.*, 2000).

The paper also emphasizes the importance of identifying efficient and effective methods for parameter and state estimation for stochastic (and partial) DEs. An approach that supports these extensions directly is the state space model. It treats partially observable non-linear stochastic dynamics and multivariate non-Gaussian observations according to

The first equation describes the Markovian transition of the state *x*_{t}, with parameters *θ*. This corresponds to stochastic dynamic prediction using discretized DEs, i.e. *x*_{t}=*f*(*x*_{t−1},*n*_{t},*θ*). Observations *y*_{t} can be related to *x*_{t} through a non-linear measurement operator with *φ* being parameters of the measurement distribution. In Dowd (2006), I applied such a model for complex non-linear dynamical systems to recover state and dynamic parameters for a system which regularly transitioned across a bifurcation point.

The problem that is considered by this paper is the estimation of static parameters. Given the observation set as and using Bayes's theorem yield the target density for the state, *p*(*x*|*y*_{1:T},*θ*,*φ*). This can be computed with sampling-based sequential Monte Carlo (MC) techniques (Künsch, 2005; Godsill *et al.*, 2004).

Unknown parameters *θ* and *φ* can be then be determined by maximizing the likelihood (see Kitagawa (1996)):

where the latter approximation relies on , which is a sample from the predictive density generated via sequential MC sampling. The resultant likelihood is affected by MC sampling variability and challenges optimizers; incorporation of kernel density estimation appears useful (de Valpine, 2004).

Computationally, application of these MC approaches to higher dimensional dynamic systems is a major challenge. Ideas based on dynamical analysis (Chorin and Krause, 2004) and effective approximations, e.g. the ensemble Kalman filter (Evensen, 2003), appear promising. I have compared some of these methods for non-linear dynamic systems (Dowd, 2007). It would be interesting to compare these further with the parameter estimation method of this paper, extended to the case of stochastic DEs.

**David J. D. Earn** (*McMaster University, Hamilton*)

Estimation of parameters of non-linear differential equations from noisy, observed time series is a problem that arises frequently in applied science. Unfortunately, anyone who has tried this is likely to be familiar with serious theoretical and computational challenges. The new method of Ramsay and colleagues is very welcome, and it will be interesting to see how it fares on a wide range of problems.

Note that *I* records the *prevalence* of the disease, i.e. the number of individuals who are currently infected. We typically observe *incidence*, i.e. ∫*β**SI* d*t*, where the integral is over the reporting interval (typically weekly or monthly, but sometimes daily).

For human diseases that have been present in the population for years, we typically have estimates of all the parameters from data other than time series of reported cases or deaths. Moreover, the SIR model as formulated in equation (41) has a globally asympotically stable equilibrium, so we can easily compare the predicted equilibrium with the observed times series (without the aid of Ramsay and colleagues).

The catch is that the transmission rate *β* is rarely constant in practice. Instead, *β* often varies seasonally, either because of seasonally changing aggregation patterns (London and Yorke, 1973) or other seasonal factors that may be difficult to pin down (Dushoff *et al.*, 2004). Seasonal forcing drastically changes the dynamics of the SIR model, often leading to co-existing stable cycles (Schwartz and Smith, 1983) or chaos (Schaffer, 1985). Since the conclusions that we draw depend strongly on the estimated amplitude of seasonal forcing (Earn *et al.*, 2000; Bauch and Earn, 2003), we need a credible way of estimating time variation in *β* and we rarely have useful data to work with other than incidence time series. The method of Ramsay and colleagues is begging to be applied to this problem and I look forward to comparing the results that are inferred from it and from previous methods (Fine and Clarkson, 1982; Ellner *et al.*, 1998; Finkenstädt and Grenfell, 2000; Bjornstad *et al.*, 2002; Wallinga and Teunis, 2004).

Finally, it is worth mentioning a vexing issue that has the potential to undermine parameter estimation for differential equation models of disease spread. The process of infectious disease transmission is fundamentally stochastic. Solutions of the SIR model (41) can be thought of as ensemble means of the true stochastic process (Kurtz, 1980), but any incidence time series represents only one realization of that stochastic process and may not accurately reflect the mean. In the specific context that I have highlighted—estimating a seasonal forcing function—this problem may not be serious if we have data covering many seasons, but it is worth bearing in mind.

**Stephen P. Ellner** (*Cornell University, Ithaca*)

But for practical acceptance I believe that selection of the smoothing parameter *λ* must be automated on a defensible basis. The profiling criterion immediately suggests cross-validation. Straight leave-one-out methods are computationally infeasible for end-users (though computer and algorithmic improvements may change this situation), but we can still use the principle of predicting something that was not used in fitting. Dynamic models predict the future, so we can evaluate them on the basis of forecasting accuracy. Let *φ*_{t}(*x*_{0};*θ*) be the model solution at time *t* starting from *x*(0)=*x*_{0}. A measure of prediction error at time interval *τ* is

- (42)

PE should be large if undersmooths or oversmooths the data, either way throwing off parameter estimates. I tried this criterion on the FitzHugh–Nagumo system (modifying MATLAB code that was provided by Hooker), with the omitted inessential being an additive perturbation to d*V*/d*t* (Fig. 14(a)) that changes the period of the oscillations (Fig. 14(b)*versus*Fig. 14(c)). Fitting five artificial data sets by profiling with a range of *λ*-values, PE selects a range of *λ*-values that is good for parameter estimation (Figs 14(d) and 14(e)). Profiling with a ‘good’*λ* performs comparably with two-step methods in which the data are smoothed without regard to the model, and the ordinary differential equation is then fitted to the smooth or its time derivative; with a ‘bad’*λ* profiling is less successful. Profiling's big advantage over two-step methods is that it does not need data on all state variables but, as this small example indicates, success may depend on choosing *λ* well.

**Chong Gu** (*Purdue University, West Lafayette*)

The authors are to be congratulated for a fine paper on a challenging problem. As shown in the paper, fitting data to models derived from ordinary differential equations (ODEs) involves numerous issues such as the numerical strategies and the methodological framework, and it is the methodological aspects that we shall comment on.

First let us attempt a crude parallel between the setting of the paper and a standard cubic spline as the minimizer of

- (43)

Setting *λ*=∞ in expression (43) forces d^{2}*x*/d*u*^{2}=0 that characterizes a *static system*, with the solution of the form *x*(*u*)=*c*_{1}+*c*_{2}*u*, where (*c*_{1},*c*_{2}) are to be determined by the data (*y*_{i},*u*_{i}) through the least squares; if precise readings of (*u*,*x*) are available from *x*(*u*)=*c*_{1}+*c*_{2}*u*, we need only two pairs of ‘initial values’ to pin down (*c*_{1},*c*_{2}). Likewise, replacing pen(*x*)=∫(d^{2}*x*/d*u*^{2})^{2} d*u* in expression (43) by pen(*x*)=∫(d^{2}*x*/d*u*^{2}+*ω*^{2}*x*)^{2} d*u* yields an *L*-spline, and setting *λ*=∞ then forces d^{2}*x*/d*u*^{2}+*ω*^{2}=0 with the solution of the form *x*(*u*)=*c*_{1} sin (*ω**u*)+*c*_{2} cos (*ω**u*). Compare these with

- (44)

where for simplicity we consider only a single ODE. The main difference between expressions (43) and (44) is the time variable *t* in expression (44) and the implicit dependence of *x* on *u*. The system parameter *θ* is absent for the cubic spline and is the period *ω* for the *L*-spline. Setting *λ*=∞ in expression (44) forces d*x*/d*t*−*f*(*x*,**u**,*t*|*θ*)=0 with the solution of the form *x*_{u}(*t*;*θ*,**c**), say, and the parameters *θ* and **c** may be fixed via least squares as in expression (44) or through alternative ‘initial values’.

As crude as the parallel is, it sheds light into the roles of various components in the proposed setting. For data smoothing via expression (43) or the like, the stochastic structure of the data is typically well specified, whereas the roughness penalty pen(*x*) is virtually an afterthought mainly to provide ‘stability’ to the end results, and one is more than willing to ‘warp’ the function away from the ‘null model’ characterized by pen(*x*)=0 to fit the data. For solutions to the dynamic systems, however, the roles of goodness of fit and ‘roughness penalty’ seems more likely reversed, with fidelity to the ODE the major concern and the ‘error distribution’ of the data an afterthought. With such an understanding, automatic *λ*-selection via cross-validation may not be the most appropriate for expression (44); cross-validation was designed to minimize the estimation error for data smoothing, in the setting of expression (43) with *y*_{i}=*x*(*u*_{i})+*ɛ*_{i}. Instead, a manual selection of *λ* that keeps ∫{d*x*/d*t*−*f*(*x*,**u**,*t*|*θ*)}^{2} d*t**ρ*, say, for some prespecified tolerance level *ρ*, might be more appropriate.

**John Guckenheimer** (*Cornell University, Ithaca*) **and****Joseph Tien** (*Fred Hutchinson Cancer Research Center, Seattle*)

A key issue in parameter estimation problems for differential equations is minimizing residual functions with optimization algorithms. As illustrated in Fig. 2, the graph of the residual as a function of the parameters may be so convoluted that smooth optimization algorithms that are based on quadratic models require initial parameter values that are very close to the optimal values. Ramsay and his colleagues smooth the residual by a spline fit, together with a penalty on discrepancies between the fitted curve to solutions of the differential equations.

Our work also introduces residual functions which involve penalties, but we focus on the relationships between qualitative properties of the differential equation solutions to the geometry of the response surface. Those relationships prompt us to propose new residual functions that incorporate geometric features of the dynamical system and simplify the landscape. Examples of these geometric features include periodic orbits, bifurcation boundaries and fast–slow decompositions of multiple-timescale solution trajectories.

This paper represents solutions of differential equations through their initial values. When these solutions depend sensitively on initial values or system parameters, the residual function has large gradients. This is evident in Fig. 2, showing a residual function for solutions to the FitzHugh–Nagumo equation fitted over approximately 2.5 periods of an oscillatory solution. Since the oscillation period varies with the system parameters, the residual is more sensitive when evaluated for longer time intervals. If the data to be fitted are at its periodic asymptotic state, we suggest fitting the periodic orbit of the model to the data instead of the solution of an initial value problem. This approach was developed by Casey (2004). Matching the period of a periodic orbit to its measured period is a step towards solving the parameter estimation problem. Furthermore, the ‘cliff’ in the response surface of Fig. 2 suggests that there is a bifurcation of the model at these parameter values. Bifurcation boundaries in the model form natural constraints of the ‘reasonable’ parameter region for fitting attractors to stationary data. We advocate using computations of bifurcation boundaries in this context.

In multiple-timescale systems, abrupt changes in solutions occur due to changes in the transitions between slow and fast segments of solutions. The geometry of fast–slow decompositions of solution trajectories can be used to define residual functions for both non-periodic and periodic solutions (Tien and Guckenheimer, 2007; Tien, 2007).

**Serge Guillas** (*Georgia Institute of Technology, Atlanta*)

I congratulate the authors for their paper. They have introduced a technique for the estimation of parameters for differential equations that is fast and precise. Unlike many smoothing situations, the large range (e.g. several orders of magnitude in the FitzHugh–Nagumo equations) of good *λ* is quite surprising. The analysis for which the authors examine the asymptotic behaviour of the estimates when *λ*∞ is very helpful and rarely done in traditional smoothing settings. It would be interesting to study further the range of values of *λ* that give accurate estimates.

The authors mention Bayesian methods as an alternative to their method. In this framework, the numerical solution to the differential equation at each sample time point is assumed to be normally distributed, with the use of the Metropolis–Hastings algorithm. In the more general context of complex computer models, two approaches have been recently developed to take into account the functional form of the output better. For well-chosen designs for the parameters, and sufficient computing power, these methods are efficient and robust, in particular if there is no complete knowledge of the set of differential equations. Higdon *et al.* (2007) represent a functional output through a principal components analysis. Bayarri *et al.* (2007) considered a decomposition of the time series of outputs in a wavelet basis. Wavelets can easily model abrupt changes in the outputs. This could be helpful for a better understanding of certain types of solutions to differential equations. Calibration can then be directly carried out on the coefficients themselves following a traditional approach (Kennedy and O'Hagan, 2001). These formulations may improve the estimation of the parameters in the case where complicated noise and biases are present. The additional discrepancy term can accommodate biases that depend on the initial conditions. Also the Bayesian approach naturally leads to an assessment of the uncertainties. Combining Gaussian processes and information from derivatives is also possible (O'Hagan, 1992; Morris *et al.*, 1993; Mitchell *et al.*, 1994; Solak *et al.*, 2003).

**Jianhua Huang** (*Texas A&M University, College Station*) **and****Yanyuan Ma** (*Université de Neuchâtel*)

We are glad to have the opportunity to discuss this stimulating and exciting paper. We tried to approach the problem from the viewpoint of familiar *M*-estimation. To simplify the notation and to focus on the main idea, consider the case of only one equation. The penalized least squares criterion function is

minimization of which for fixed *λ* gives a joint estimation of **c** and *θ*. Potential overfitting of the data that is caused by a high dimensional parameter **c** is avoided owing to the second term in the criterion function, where a large *λ* can reduce significantly the effective dimension of **c**.

Next we report some results from an experiment on fitting the FitzHugh–Nagumo equations in example 3.1 by using the software that was kindly provided by the authors. Motivated by many real biomedical data sets where only a sparse sample is available, we considered a sparse sampling of the profiles of *V* with only 21 observations. Fig. 15(a) shows that the parameter estimate can be seriously biased. However, when we reran the program but increased the number of bases in the collocation method to 10 times the sample size, we obtained reasonable estimation, as shown in Fig. 15(c). This prompted us to believe that the number of bases that are used in collocation should be decided by the essential nature of the ordinary differential equation instead of just the number of observations. Our belief is reinforced by the results from using 21 and 201 basis functions to fit data sampled at 201 time points as given in Figs 15(b) and 15(d). Our finding indicates an important difference between pure data smoothing and smoothing in parameter estimation for ordinary differential equations.

**Edward L. Ionides** (*University of Michigan, Ann Arbor*)

The authors are to be congratulated for their elegant approach to reconciling mechanistic dynamic models with time series data. Their methodology appears to be readily applicable to a range of challenging inference problems. I would like to compare and contrast the deterministic dynamic modelling approach, which was adopted by the authors, with a stochastic dynamic modelling approach. For the sake of discussion, ordinary differential equations (ODEs) can be compared with stochastic differential equations (SDEs), though similar considerations will apply to other models, such as Markov chains.

A drawback of the authors’ method is that the fitted model is not readily apparent. One may be led to interpret the fitted model as an ODE with parameter vector , but of course the trajectories that are fitted to the data do not perfectly follow this ODE. There is allowance for some deviation, which is controlled by the parameter *λ*, and this deviation may be important for both the qualitative and the quantitative behaviour of the system. The differences between stochastic dynamic models and their approximating ODEs, which is termed the ‘deterministic skeleton’ of the model, have been found to be relevant in ecological systems (Coulson *et al.*, 2004). One related issue is, how should trajectories be simulated from the fitted model? In the context of the tank reactor, for example, it would seem desirable if the variability between simulated trajectories were comparable with variability between replications of the experiment. Additionally, such simulated trajectories should be available to a researcher who is aware of only the reported values of and *λ*.

One way around these difficulties is to consider the equivalent SDE, which is given by the authors in Section 5.2, as the fitted model. The authors are reluctant to do this since ‘lack of fit in non-linear dynamics is due more to misspecification of the system under consideration than to stochastic inputs’. I would argue that it should be acceptable to interpret the noise as model misspecification combined with random variation; such interpretations are certainly routine in linear regression, for example. Quite general methods exist for carrying out inference in the context of partially observed non-linear SDE systems (Ionides *et al.*, 2006). However, the authors’ penalized spline approach has considerable computational advantages that should motivate future work into clarifying the relationship between the penalized splines and comparable SDE models.

**Satish Iyengar** (*University of Pittsburgh*)

I congratulate the authors for bringing to the attention of the statistics community methods of inference for differential equations.

Early in their paper, the authors mention the case ‘when only a subset of variables of a system is actually measured…’. I suspect that this case is quite common in many areas. It typically leads to the non-identifiability of parameters of the model. We encountered this problem in our studies of varying spike rates in certain monkey interneurons (Czanner, 2004; Czanner *et al.*, 2007). We fit a leaky integrate-and-fire model (Liu and Wang, 2001) for the (observed) membrane potential *V* and the (latent) intracellular calcium concentration *X*. The model has the form

where *W* and are independent Brownian motions. On firing *V* returns to its reset potential and *X* is increased by a constant to model the resulting calcium influx. In the discretized version, there are about a dozen parameters, with the number of identifiable functions of the parameters depending on the details of the experiment. A careful study of what those identifiable functions are can be used to suggest auxiliary experiments that are needed to estimate the original parameters. However, determining the identifiable parameters can be a rather involved task. Widely applicable approaches to do that would be useful.

**Robert E. Kass** (*Carnegie Mellon University, Pittsburgh*) **and Jonathan E. Rubin and Sven Zenker** (*University of Pittsburgh*)

The fitting of differential equations to data has an illustrious history in neuroscience, but further progress requires solutions to several important problems. For example, in their pioneering work, Hodgkin and Huxley (1952) modelled action potential generation in the space-clamped squid giant axon by fitting parameters in a system coupling the voltage equation

- (45)

to equations for auxiliary variables *m*, *h* and *n* each of the form

- (46)

Each pair (*α*_{x}(*V*),*β*_{x}(*V*)) incorporates five parameters, whereas the voltage-dependent currents in equation (45) include four (*I*_{Na}), three (*I*_{K}) and two (*I*_{L}) parameters respectively. Together, equations (45) and (46) contain 27 parameters.

An immediate issue is that the parameter values in equations (45) and (46) are not uniquely determined from readily available data, i.e. a näive statistical model will be non-identifiable. The best solution is to obtain additional data (as Hodgkin and Huxley (1952) did), but this is often impractical. Methodologically, two things are needed:

- (a)
a method for checking whether the statistical model is identifiable and,

- (b)
when it is not, a constructive method for proceeding.

Item (a) has been discussed in the optimization literature (e.g. Nocedal and Wright (2006)). Item (b) is generally more difficult. One possibility is to simplify models to reduce the number of parameters. This may be disadvantageous in situations where there is a direct correspondence between model structure and physiological interpretation, as inference about physiological parameters is often the objective, whereas non-uniqueness of parameter vectors may reflect physiological reality (Prinz *et al.*, 2004). For such scenarios, local optimization methods like that presented by Ramsay, Hooker, Cambell and Cao are of limited use. An alternative is to apply simulation-based Bayesian inference to compute a (potentially multimodal) posterior density on the parameter vector and thereby to quantify uncertainty about lower dimensional parameter subsets of interest (Zenker *et al.*, 2006).

We hope that the interesting overview by Ramsay, Hooker, Cambell and Cao will succeed in drawing attention to this important class of problems. The large body of literature on collocation methods should be considered carefully. This may be a case in which the field of statistics will advance most rapidly by incorporating results from other mathematical disciplines, via collaborative research that delves deeply into particular scientific problems.

**Stefan Körkel** (*Humboldt-Universität, Berlin*)

In this paper, the authors present an approach for the estimation of parameters in non-linear differential equation models.

For the parameterization of the differential equations, a collocation method is applied with an expansion of the state solution in terms of basis functions introducing the collocation coefficients as additional *nuisance* parameters.

The data fitting criterion, a negative log-likelihood of the observation error distribution, is augmented by a regularization term, the *equation fidelity*, which is a norm of the differential equation residual, which is numerically approximated by a quadrature formula. The two parts of this objective function are weighted by a *smoothing multiplier**λ* to control the relative emphasis on fitting the data and solving the model equations.

The authors propose to solve the optimization problem for parameter estimation in a hierarchical way: an outer optimization with respect to the *structural* equation parameters is performed subject to an underlying inner optimization with respect to the collocation coefficients for fixed equation parameters.

The choice of the smoothing parameter *λ* is crucial for the robustness of the method. In the numerical examples presented, the authors could find suitable values by manual adjustment. Alternatively, they suggest an automatic iterative strategy based on the idea of preventing that the regularization distorts the estimate. The behaviour for *λ*∞ is studied and shows a natural behaviour of the approach.

The method that is presented by the authors exhibits robustness and flexibility. This is demonstrated for four examples: two academic test problems, the FitzHugh–Nagumo equation system which leads to a very non-convex least squares estimation problem and the tank reactor equation system which, for particular experimental settings, has a behaviour which is close to instability. Moreover, the method is applied to two real data examples: nylon production and flare dynamics in lupus.

For all these examples, appropriate smoothing can be found and parameter estimates can be obtained from quite noisy data and in situations where not all model states can be observed. The choice of the initial guesses for the parameters is not critical at all. For comparision, for such problems with high non-convexity of the least squares fitting criteria, Gauss–Newton methods often are not usable because of small convergence regions.

The hierarchical optimization approach presented requires higher computational effort compared with an all-at-once approach, but it provides a very robust method for the estimation of parameters in intricate non-linear situations.

**Reg Kulperger** (*University of Western Ontario, London*)

I congratulate the authors on their proposal of a very useful and practical method. Their idea of projecting the differential equation (DE) solution to a linear space through expression (7) and then not having to find the coefficients *c*(*θ*) explicitly in terms of *θ* are the key elements. It is impressive that their method works amazingly well, and in some cases with data on only a subset of the components. The real example in Section 4.1 shows a very good fit of the data and estimated DE solutions.

In Section 3.1 you have chosen the standard deviation to be 0.5. How stable is the estimation over different noise levels? It is reasonable to hope for good estimates with small noise, but at what level does the estimation break down?

Fig. 6, and the discussion around it, suggests a practical way of choosing *λ*, the penalty tuning parameter. In an Akaike or Bayes information criterion the penalty is a function of the sample size *n*. The penalty in this paper is more in the spirit of spline regression but does not explicitly involve the sample size *n* or the level of noise. Are these implicitly reflected in the tuning parameter *λ*?

The estimator variances that are approximated by expressions (18) or (24) are compared in a simulation study in Section 3.2. They are first-order delta method approximations, and they perform very well in these examples compared with the actual sample standard deviation in the simulation experiment for the two models in Section 1.2. How do you expect this approximation to behave in other model applications and different noise levels?

Section 5 raises some other interesting questions. Does a lagged equation model also require data at offsets *δ*, or is it possible for the data still to be irregularly spaced or at least not depending on *δ*? If the former is needed then *δ* is a number that must be known and not estimated. Equivalently, is *δ* identifiable?

The stochastic DE (SDE) that is described in Section 5.2 is considered with diffusion term *λ* d*W*(*t*) (where *λ* is not the same as the penalty parameter). In general *X*(*t*)=*E*{*x*(*t*)} does not satisfy the noise-free DE *X*(*t*)=*f*{*X*(*t*),*u*,*t*|*θ*}. These SDE processes have quite different dynamics from those of the regression form that is described here. Is there some analogous method for an SDE setting, or are these a different class of estimation problems?

**Subhash Lele** (*University of Alberta, Edmonton*)

This paper proposes a method to confront non-linear dynamical models with real data so that they provide not just pretty pictures and qualitative understanding, but also quantitative predictions and model adequacy measures.

The method that is developed in this paper is intuitive and appealing but somewhat *ad hoc*.

- (a)
How does the choice of the number and the form of basis functions affect the estimates?

- (b)
Do and how do the standard errors and resultant confidence intervals reflect the amount of approximation that is involved in equation (7)?

- (c)
The method is based on estimating functions but it is unclear whether the resultant estimating functions are, in fact, zero unbiased or not. Are they information unbiased? If not, the asymptotic variances should be based on Godambe information rather than Fisher information.

- (d)
What kind of asymptotics are appropriate: infill asymptotics, or increasing domain asymptotics or both (

Cressie, 1991)?

- (e)
Can we use resampling techniques to obtain robust standard errors?

- (f)
In population dynamics models, there is demographic stochasticity and environmental stochasticity (

Lande *et al.*, 2003). Can the methodology that is developed in this paper be useful for such models?

- (g)
With hidden layers in the model, how would you know that the parameters that you are trying to estimate are, in fact, identifiable?

Recently, extending the work of Robert and Titterington (1998), I, jointly with my colleagues, have developed a technique, which is called data cloning, to conduct likelihood inference for hierarchical models (Lele *et al.*, 2007). Data cloning is based on the simple idea that, as the sample size increases, posterior distributions converge to a Gaussian distribution with mean equal to the maximum likelihood estimate and variance equal to the inverse of the Fisher information. One can artificially increase the sample size by cloning the data several times. Then, a standard application of Markov chain Monte Carlo methods provides the maximum likelihood estimate along with its standard error. We are currently using the data cloning method to conduct inference for stochastic population dynamics models for single or multiple populations such as the Lotka–Volterra model. We are also using data cloning to conduct inference for epidemiological models such as the susceptible–infected–recovered model. One of the major advantages of the data cloning method is that it provides a simple check for the identifiability of the parameters. We have found that the initial conditions are, in general, very difficult to estimate (if identifiable). But, otherwise, the data cloning method is computationally quite fast.

**Lang Li** (*Indiana University, Indianapolis*)

I congratulate the authors for their breakthrough in parameter estimation problems for differential equations. I would also like to express my appreciation for the effort of the Royal Statistical Society. This pioneer paper advocates the integration of cutting edge statistics and traditional mathematics.

As a statistician working exclusively in the pharmacology area, I can see an immediate application of this smoothing approach to pharmacokinetics models. Besides the work by Gelman (1996) that is referred to in the text, more comprehensive reviews of statistical and computational work in pharmacokinetics models can be found in Davidian and Giltinan (2003) and Pillai *et al.* (2005). It is worthwhile to mention that in Li (2002, 2004) the non-linear relationships between pharmacokinetics parameters and covariates were modelled by cubic splines. These works were probably the earliest integration of smoothing techniques and differential equations in pharmacokinetics models. So far, all pharmacokinetics model fittings are based on the numerical solution of a differential equation, when the analytical solution is not available.

Now, the generalized smoothing approach totally changed the paradigm of parameter estimation for differential equations. It transformed a fragmented numerical procedure into a uniformed non-linear regression. As the authors claimed in the paper, the computational stability is much improved. I think that this is a major improvement.

Computational speed is obviously a critical factor for its more general usage. When not all the response variables in the ordinary differential equations are measurable, the unmeasured variables still need to be solved from ordinary differential equations, and they will be used in the penalty term. According to current smoothing parameter selection strategy, the model may need to be fitted to the data multiple times. Hence, it is not clear whether or not its computational expense is lower than that of the other approaches. In all pharmacokinetics models, only blood samples can be assessed; the drug concentrations in all the other organs or peripheral compartments cannot be directly measured. Therefore, an evaluation of the speed for various approaches is absolutely necessary.

One important application of the pharmacokinetics model is its ability in prediction. It will be interesting to see whether the generalized smoothing approach can improve the prediction or not.

**Sylvain Sardy** (*Université de Genève*)

The backbone of the methodology proposed is the expansion representation of output functions so that both and its derivative are a linear form of the same coefficients **c**_{i}. Besides solving the non-trivial optimization, providing variance estimation is also an achievement. The authors’ substantial work is the source of many research directions for statisticians, like non-parametric estimation.

The authors essentially solve the least squares problem for systems of differential equations by letting *λ*_{i} become large in equation (13). At the limit, no regularization is performed: they solve the constrained problem, as in linear regression one could solve by successively solving and letting *λ*∞. This observation leads to two points. First the constrained optimization could be solved efficiently by handling constraints directly. Second if the true parametric equations are not completely known, the practice is to do model selection. Take the FitzHugh–Nagumo equations (2) for instance: we could start with the richer model

and estimate a sparse vector of coefficients *θ*=(*a*,*c*,*d*,*e*,*f*,*g*,*h*) while satisfying the constraints that are imposed by the differential equations. A possible model selection strategy consists in solving a lasso-type *l*_{1} penalized least squares. The convex *l*_{1}-penalty on *θ* may also have the advantage of removing some of the ripples of Fig. 2. Solving the constrained *l*_{1} penalized least squares is a worthy challenge to achieve model selection for systems of non-linear differential equations. Finally, increasing the dimension of *θ* with the sample size, non-parametric estimation becomes possible.

**Hulin Wu** (*University of Rochester*)

I congratulate Professor Ramsay and his collegues on their stimulating paper that introduces the inverse problem of ordinary differential equations (ODEs) to the statistical research community. The problem of predicting the results of measurements for a given ODE model is called the *forward problem*. The *inverse problem* is to use the measurements of state variables to estimate the parameters in the ODE model. This paper reflects the important effort to promote more statistical research to address the statistical inverse problem for differential equation models. The inverse problem for ODE models is a long-standing problem in the mathematical modelling research community, but it is less familiar in the statistical research community. However, this is an area in which statisticians can make significant contributions. Mathematicians and engineers have made great progress in addressing the ODE inverse problem, but mostly from theoretical perspectives and on the basis of the standard least squares principle (Anger, 1990; Lawson and Hanson, 1995; Englezos and Kalogerakis, 2001; Tarantola, 2005; Aster *et al.*, 2005; Li *et al.*, 2005). Modern statistical techniques have not been widely used in this field.

Ramsay and his colleagues introduced an interesting smoothing-based profiling estimation procedure to estimate parameters in ODE models. This method avoids numerically solving the ODEs, which is a good feature compared with the least squares method. The proposed penalized log-likelihood and least squares criteria (14) and (15) are weighted ‘goodness-of-fit’ measures to the observed data and to the ODE model. This indicates that both observations and ODE model have errors, and the criteria proposed are an attempt to trade off these two errors. Thus, the optimal weight (*λ*) should depend on the relative magnitudes of the observation error and the ODE model error. Thus, there is a need to introduce the model error into the specification of the ODE model which is similar to the Kalman filtering in the state space model (Stengel, 1994).

It is worthwhile to point out that there are a few publications on ODE parameter estimation in the statistical literature. For example, Li *et al.* (2002) proposed a spline-based estimation method to estimate the time varying parameters in a pharmacokinetic (ODE) model for longitudinal data, whereas Chen and Wu (2007) proposed a local kernel smoothing-based two-step estimation method to estimate time varying parameters in ODE models. Huang and Wu (2006) and Huang *et al.* (2006) employed a hierarchical Bayesian approach to estimate kinetic parameters in ODE models for longitudinal data.

The **authors** replied later, in writing, as follows.

We thank the Royal Statistical Society for providing the venue for this paper and its discussion, and the discussants for their many insightful comments from so many different backgrounds. There seems to be near universal agreement on the lack of good statistical methodology for estimation and inference in non-linear dynamics and on the need for greater involvement from the statistical community in these problems. The generation of interest may be the most important contribution of our paper. We are, of course, not the only statisticians to have worked in this field, and we thank the discussants for adding to our references to previous work. The range of ideas in the commentaries indicates the breadth of research problems that remain open, and we look forward to exciting times. From among the many issues raised, we have selected a few that especially require further comment.

#### Choosing *λ* for inference and prediction

Smoothing parameter choice is clearly the most vexing aspect of our method. We do not have ready answers, and in fact we think that interesting answers will have to wait for a tighter specification of the questions. For example, Gu points out that there are two distinct and often contradictory goals here. The smoothing objective of representing the observed trajectories well will often require somewhat smaller values of *λ* than will the problem of estimation of the parameters *θ*.

Criteria such as cross-validation, generalized or not, are too tightly tied to data smoothing to be reliable routes to optimal parameter estimation. More generally, a data smoother is only one example of a function *g*(*θ*) that may be the actual target of the experiment and subsequent data analysis, and where we judge the quality of by the usefulness of . Ellner raises the important question of *extrapolation*, either further forward in time or for new runs, given that our smooth is not a direct solution to the ordinary differential equation (ODE). We are intrigued by his ideas, and we note their resemblance to the path following techniques that are described by Smith. We look forward to seeing further developments; a particular question would be how far ahead we should look.

When the smooth and the ODE do not coincide, we suggest that it is the smooth that should be taken to represent the actual trajectory of the system. However, the discrepancy between the two can be used as a diagnostic for potentially misspecified measurement processes, such as in the autoregressive integrated moving average structure that is highlighted by Olhede.

#### Stochastic differential equations

We warmly agree with the many commentators who insist that no experiment is completely deterministic and free of external influences. Numerous of them have pointed out the resemblance of our methods to stochastic differential equations (SDEs), where the usual notation is

with d*B*_{t} being the innovation distribution and *σ*(*X*_{t},*t*) specifying its standard deviation, possibly varying with the process level *X*_{t} and otherwise with respect to time *t*.

The innovation distribution in SDEs is intended to account for random variation and unobserved influences. As Ionides points out, it may also be used to account for model misspecification, although this needs to be taken in conjunction with diagnostics for systematic lack of fit, As with serial correlation in linear regression, some care needs to be taken in evaluating the appropriateness of the innovation distribution. Here, Smith's methods have some interesting diagnostic ideas and we would like to see whether they could also be used to suggest some form of serial correlation as, for example, in an integrated Gaussian process. Allowing the innovation distribution to vary over state space (which is another intriguing idea) could be incorporated in the smoothing methods that we describe, but we are cautious about overcomplicating models without good reason.

The connection between penalized splines and Gaussian processes has long been recognized, and formalized, for example, in Wahba (1990). We are working on extending these results to non-linear penalties of the type that we use, and we thank Ionides for his encouragement. An alternative approach to estimating SDEs by using smoothing could be to represent the innovation distribution as

where the **c** are random effects in the spirit of Ruppert *et al.* (2005). Our estimation procedure would then look like conditional inference in a non-linear mixed model. This opens the door, for example, to restricted maximum likelihood type estimates of *λ*. Such an approach reintroduces some of the numerical difficulties that we sought to avoid, but we are exploring how mixed model ideas could be translated into our methods.

#### Diagnostics

No models are perfect and Kovac, Olhede, Smith, Boker, and Chow and Tiberio have all pointed out the need for good methods to suggest model improvements. Kovac and Olhede both suggest methods for finding serial correlation that may need to be accounted for via a change in the likelihood, or for finding regime changes which would motivate a change in parameter values. Chow and Tiberio would use the penalty as a way of checking for misspecification on the *derivative* scale, which would give us direct access to where the model may be wrong structurally. We have developed this idea in Hooker (2007), including examining some identifiability issues. One problem is the vast range of model modifications that are possible in a non-linear dynamic model, and we advise particular caution and consultation with domain experts.

#### Extensions

A large range of further models to which our methods could be applied have been mentioned by commentators. We would like to point out that some of the desired functionality is already available in the publicly provided software, although these have not been directly addressed. In particular, we allow for *θ* to be penalized by a twice differentiable function. This provides a way to include a Bayesian prior (Kass and Guillas), parameters that vary smoothly over time (Earn, Kovac and Smith) and mixed models over experiments (Li, and Donnet and Samson). We also allow for mixtures of derivatives, including zero order.

Unfortunately, the choice of norms that is desired by Biegler and Olhede is not available in current software, but we agree that this represents an important area of software development. The partial differential equations that are desired by Bampfylde are more problematic, both in terms of implementation and in terms of theory. Unlike ODEs, partial differential equation boundary conditions are infinite dimensional and must be constrained in some way to ensure that the problem is identifiable. This seems like a fascinating area for future work.

#### Bayesian methods and identifiability

Bayesian analysis (Dowd, Guillas, Kass and Wu) has been used in some of the most successful applications of statistical methods to non-linear dynamic systems. This is at least partly due to the ease of implementation of Markov chain Monte Carlo computation and its ability to side-step the issue of parameter identifiability. However, our own experience is that the local minima that plague non-linear least squares methods is also a problem for a Bayesian approach, so one must be cautious in concluding that the Markov chain has converged.

In recent work, Campbell (2007) has adapted our relaxed fit smoothing to a collocation tempering approach with Markov chain Monte Carlo methods. In this parallel chain Markov chain Monte Carlo algorithm, one chain uses the solution to the ODE **x**_{θ}(*t*), as the location parameter in the likelihood. The remaining parallel chains are constructed by substituting **x**_{θ}(*t*) with smooth approximations to the ODE solution **x**_{θ,λ}(*t*)=**c**(*θ*,*λ*)^{′} *φ*(*t*), where *λ* is fixed within each chain. Parameters are allowed to swap between parallel chains, similarly to parallel tempering (Geyer, 1991), leading to improving convergence and stability. Furthermore, the combination of chains using **x**_{θ}(*t*) and **x**_{θ,λ}(*t*) allows inference on *θ* and the fits to the data from the deterministic model and a relaxed smooth.

There is a substantial literature on identifiability in ODEs, as picked up by Kass and Iyengar, and it is not difficult to find systems which are unidentifiable. A simple diagnostic is to examine the Fisher information matrix at the current parameters, as do Wu *et al.* (2007) for the dynamics of human immunodeficiency virus.

#### Bases

Huang and Ma, and Olhede note that, unlike traditional smoothing, a large number of basis functions may be required by our collocation approach. Deuflhard and Bornemann (2000) reviewed the literature in numerical analysis on the size of basis. If we intend to let *λ*∞, then it is sufficient to select a basis that is sufficiently rich to represent a solution to the ODE. Could stochastic differential equations require even richer bases? This may be possible, but we know of no work in the area.

The choice of quadrature technique is an issue, and we know that our implementation may not be optimal. Biegler, Kovacs and Olhede argue for penalty norms that would allow the penalty to be explicitly set to zero for a finite *λ*, and Sardy observes that this is only possible if the quadrature rule contains no more points than basis functions. In fact, collocation methods are usually based on Gauss–Radau quadrature between knots with the same number of Legendre polynomial terms as quadrature points.

We reran the FitzHugh–Nagumo simulations using Huang and Ma's 201 observations and 201 knots but placed equally weighted quadrature points only at each knot. At *λ*=10^{4}, there was no observable difference over 200 simulations between this quadrature and the Simpson's rule that we initially employed in terms of parameter estimation bias and standard error. However, the new quadrature rule was about 100 times faster to provide answers.

However, we encountered a new issue when we smoothed a sample of simulated date at the higher smoothing level *λ*=10^{7} using the true parameters. Fig. 17 shows that Simpson's rule quadrature produces a substantial distortion in the initial shape of the path, whereas the simpler collocation regime remains indistinguishable from the true trajectory. In contrast, the FitzHugh–Nagumo dynamics are comparatively mild, and our experience is that the choices of bases and quadrature methods for systems with sharper dynamics and discontinuous inputs require considerable care.

#### Dynamical features

Dynamical features such as limit cycles, fixed points, bifurcations and chaos are central areas of interest in non-linear dynamics and, as both Brown, and Guckenheimer and Tien observe, they have played very little role in traditional parameter estimation techniques, including our own. Too little attention has been given to problems of inference about dynamical features. However, along with Kulperger, we note that dynamic behaviour can be quite different in stochastic differential equations and the analysis that is required to understand it is not necessarily easy.

Guckenheimer and Tien suggest only searching the parameter space where limit cycles exist. In general, dynamical features, when they can be readily analysed, can be incorporated in Bayesian priors. Using estimated features such as periods and peaks as data is also interesting, but methods for understanding uncertainty from this feature perspective remain to be developed.

#### Response surfaces

We are pleased to see so many commentaries on our Fig. 5: the nature of response surfaces that must be minimized has been one of the factors retarding progress in the area. In common with our approach, several methods have been developed over the years that rely on relaxing the solution to the differential equation, at least at intermediate steps. The idea of fitting cycles independently, as advocated by Guckenheimer and Tien, may be viewed as using different initial conditions for each cycle. This is similar to the methods in Bock (1983), in which the ODE is solved over adjoining small intervals, and where discontinuities at interval boundaries are successively reduced. Tjoa and Biegler (1991) also provided methods that do not explicitly solve the ODE until the final set of parameter estimates.

An explanation for why the approach provides better-conditioned minimization problems could be that, if the approximate trajectory is different from the trajectory that is given by the parameters, the response surface will be partly affected by , which is frequently more convex than the original likelihood criterion. We believe that firming up this conjecture may be useful for other difficult optimization problems.

#### Asymptotics

No statistical paper is complete without an asymptotic analysis, and we thank Olhede for providing ours. This is given in the context of a deterministic model in which the essential point is to ensure that *θ* continues to affect **x**_{θ,λ}. With infill asymptotics, we are now back to maximum likelihood theory. In the expanding domain case, the situation is somewhat more complicated, since we need to ensure that *λ* increases at a rate that is sufficiently fast to force the convergence of **x**_{θ,λ} to an exact solution; this again implies an *N*^{1+δ}-rate. To answer Lele's question, infill asymptotics with independent and identically distributed residuals about a deterministic system does not appear to be reasonable and suggests either a Gaussian process for the errors or a stochastic differential equation, or both. In such cases, neither infill nor expanding domain asymptotics alone may be sufficient to provide consistency.

#### Conclusion

We have been impressed and stimulated by the range of ideas and perspectives in the commentaries and thank the Royal Statistical Society for making this discussion possible. There appear to be several independent suggestions that might benefit from collaboration between our discussants. Many other comments will require a paper or more to address adequately. It is clear that we still have much work to do, both for this method and for inference in non-linear dynamics generally. We hope that this paper has demonstrated both the challenge and the interest in these problems, and that it inspires more statisticians to help us to solve them.