## 1. Challenges in dynamic systems estimation

### 1.1. Basic properties of dynamic systems

We have in mind a process that transforms a set of *m* input functions **u**(*t*) into a set of *d* output functions **x**(*t*). Dynamic systems model output change directly by linking the output derivatives to **x**(*t*) itself, as well as to inputs **u**:

Vector ** θ** contains any parameters defining the system whose values are not known from experimental data, theoretical considerations or other sources of information. Systems involving derivatives of

*x*of order

*n*>1 are reducible to expression (1) by defining new variables,

*x*

_{1}=

*x*and Further generalizations of expression (1) are also candidates for the approach that is developed in this paper but will not be considered. Dependences of

**f**on

*t*other than through

**x**and

**u**arise when, for example, certain quantities defining the system are themselves time varying.

Differential equations as a rule do not define their solutions uniquely, but rather as a manifold of solutions of typical dimension *d*. For example, d^{2}*x*/d*t*^{2}=−*ω*^{2} *x*(*t*), reduced to and , implies solutions of the form *x*_{1}(*t*)=*c*_{1} sin (*ω**t*)+*c*_{2} cos (*ω**t*), where coefficients *c*_{1} and *c*_{2} are arbitrary; and at least *d*=2 observations are required to identify the solution that best fits the data. *Initial value* problems supply **x**(0), whereas *boundary value* problems require *d* values selected from **x**(0) and **x**(*T*).

However, we assume more generally that only a subset ℐ of the *d* output variables **x** may be measured at time points *t*_{ij}, *i* ∈ ℐ⊂{1,…,*d*}, *j*=1,…,*N*_{i}, and that *y*_{ij} is a corresponding measurement that is subject to measurement error *e*_{ij}=*y*_{ij}−*x*_{i}(*t*_{ij}). We may call such a situation a *distributed partial data* problem. If either there are no observations at 0 and *T*, or the observations that are supplied are subject to measurement error, then initial or boundary values may be considered as parameters that must be included in an augmented parameter vector *θ*^{*}=(**x**(0)^{′},*θ*^{′})^{′}.

Solutions of the ordinary differential equation (ODE) system (1) given initial values **x**(0) exist and are unique over a neighbourhood of (0,**x**(0)) if *f* is continuously differentiable or, more generally, Lipschitz continuous with respect to **x**. However, most ODE systems are not solvable analytically, which typically increases the computational burden of data fitting methodology such as non-linear regression. Exceptions are linear systems with constant coefficients, where the machinery of the Laplace transform and transform functions plays a role, and a statistical treatment of these is available in Bates and Watts (1988) and Seber and Wild (1989). Discrete versions of linear constant coefficient systems, i.e. stationary systems of difference equations for equally spaced time points, are also well treated in the classical time series autoregressive integrated moving average and state space literature, and will not be considered further in this paper.

The insolvability of most ODEs has meant that statistical science has had comparatively little effect on the fitting of dynamic systems to data. Current methods for estimating ODEs from noisy data, which are reviewed below, are often slow, uncertain to provide satisfactory results and do not lend themselves well to follow-up analyses such as interval estimation and inference. Moreover, when only a subset of variables in a system is actually measured, the remainder are effectively functional latent variables, a feature that adds further challenges to data analysis. For example, in systems describing chemical reactions, the concentrations of only some reactants are easily measurable and inference may be based on measurements of external quantities such as the temperature of the system.

This paper describes an extension of data smoothing methods along with a generalization of profiled estimation to estimate the parameters ** θ** defining a system of non-linear differential equations. High dimensional basis function expansions are used to represent the outputs

**x**, and our approach depends critically on considering the coefficients of these expansions as nuisance parameters. This leads to the notion of a

*parameter cascade*, and the effect of nuisance parameters on the estimation of structural parameters is controlled through a multicriterion optimization process rather than the more usual marginalization procedure.

### 1.2. Two test bed problems

#### 1.2.1. FitzHugh–Nagumo equations

The FitzHugh–Nagumo equations were developed by FitzHugh (1961) and Nagumo *et al.* (1962) as simplifications of the Hodgkin and Huxley (1952) model of the behaviour of spike potentials in the giant axon of squid neurons:

The system describes the reciprocal dependences of the voltage *V* across an axon membrane and a recovery variable *R* summarizing outward currents. Although not intended to provide a close fit to neural spike potential data, solutions to the FitzHugh–Nagumo ODEs do exhibit features that are common to elements of biological neural networks (Wilson, 1999).

The parameters are ** θ**={

*a*,

*b*,

*c*}, to which we shall assign values (0.2,0.2,3) respectively. The

*R*-equation is the simple constant coefficient linear system with linear inputs

*V*and

*a*. However, the

*V*-equation is non-linear; when

*V*>0 is small, and consequently exhibits nearly exponential increase but, as

*V*passes ±√3, the influence of −

*V*

^{3}/3 takes over and turns

*V*back towards 0. Consequently, solutions corresponding to a range of initial values quickly settle down to alternate between the smooth evolution and the sharp changes in direction that are shown in Fig. 1.

A concern in dynamic systems modelling is the possibly complex nature of the fit surface. The existence of many local minima has been commented on in Esposito and Floudas (2000), and some computationally demanding algorithms, such as simulated annealing, have been proposed to overcome this problem. For example, Jaeger *et al.* (2004) reported using weeks of computation to compute a point estimate. Fig. 2 displays the integrated squared difference between the paths in Fig. 1 and those resulting from varying only the parameters *a* and *b*. The features of this surface include ‘ripples’ due to changes in the shape and period of the limit cycle and breaks due to bifurcations, or sharp changes in behaviour.

#### 1.2.2. Tank reactor equations

The chemical engineering concept of a continuously stirred tank reactor (CSTR) consists of a tank surrounded by a cooling jacket containing an impeller which stirs its contents. A fluid containing a reagent with concentration *C*_{in} enters the tank at a flow rate *F*_{in} and temperature *T*_{in}. A reaction produces a product that leaves the tank with concentration *C* and temperature *T*. A coolant in the cooling jacket has temperature *T*_{co} and flow rate *F*_{co}.

The differential equations that are used to model a CSTR, simplified by setting the volume of the tank to 1, are

The input variables play two roles in the right-hand sides of these equations: through added terms such as *F*_{in}*C*_{in} and *F*_{in}*T*_{in}, and via the weight functions *β*_{CC},*β*_{TC},*β*_{TT} and *α* that multiply the output variables and *T*_{co}. These time-varying multipliers depend on four system parameters as follows:

where *T*_{ref} is a fixed reference temperature within the range of the observed temperatures, and in this case was 350 K. These functions are defined by two pairs of parameters: (*τ*,*κ*) defining coefficient *β*_{CC} and (*a*,*b*) defining coefficient *α*. The factor 10^{4} in *β*_{CC} rescales *τ* so that all four parameters are within [0.4,1.8]. These parameters are gathered in the vector ** θ** in system (1) and determine the rate of the chemical reactions that are involved, or the reaction kinetics.

The plant engineer needs to understand the dynamics of the two output variables *C* and *T* as determined by the five inputs *C*_{in},*F*_{in},*T*_{in},*T*_{co} and *F*_{co}. A typical experiment designed to reveal these dynamics is illustrated in Fig. 3, where we see each input variable stepped up from a base-line level, stepped down, and then returned to base-line. Two base-line levels are presented for the most critical input, the coolant temperature *T*_{co}.

The behaviours of output variables *C* and *T* under the two experimental regimes, given values 0.833, 0.461, 1.678 and 0.5 for parameters *τ*, *κ*, *a* and *b* respectively, are shown in Fig. 4. When the reactor runs in the cool mode, where the base-line coolant temperature is 335 K, the two outputs respond smoothly to the step changes in all inputs. However, an increase in base-line coolant temperature by 30 K generates oscillations that come close to instability when the coolant temperature decreases, something that is undesirable in an actual industrial process. These perturbations are due to the double effect of a decrease in output temperature, which increases the size of both *β*_{CC} and *β*_{TC}. Increasing *β*_{TC} raises the forcing term in the *T*-equation, thus increasing temperature. Increasing *β*_{CC} makes concentration more responsive to changes in temperature but decreases the size of the response. This push–pull process has a resonant frequency that depends on the kinetic constants and, when the ambient operating temperature reaches a certain level, the resonance appears. For coolant temperatures that are either above or below this critical zone, the oscillations disappear.

The CSTR equations present two challenges that are not an issue for the FitzHugh–Nagumo equations. The step changes in inputs induce corresponding discontinuities in the output derivatives that complicate the estimation of solutions by numerical methods. Moreover, the engineer must estimate the reaction kinetics parameters to estimate the cooling temperature range to avoid, but a key question is whether all four parameters are actually estimable given a particular data configuration. Step changes in inputs and near overparameterization are common problems in dynamic systems modelling.

### 1.3. Review of current ordinary differential equation parameter estimation strategies

Procedures for estimating the parameters defining an ODE from noisy data tend to fall into three broad classes: linearization, discretization methods for initial value problems and basis function expansion or collocation methods for boundary and distributed data problems. Linearization involves replacing non-linear structures by first-order Taylor series expansions and tends only to be useful over short time intervals combined with rather mild non-linearities, and will not be considered further. There is a large literature on numerical methods for solving constrained optimization problems, under which parameter estimation usually falls; see Biegler and Grossman (2004) for an excellent overview.

#### 1.3.1. Data fitting by numerical approximation of an initial value problem

The numerical methods that are most often used to approximate solutions of ODEs over a range [*t*_{0},*t*_{1}] use fixed initial values **x**_{0}=**x**(*t*_{0}) and adaptive discretization techniques (Biegler *et al.*, 1986). The data fitting process, which is often referred to by text-books as the *non-linear least squares* (*NLS*) method, works as follows. A numerical method such as the Runge–Kutta algorithm is used to approximate the solution given a trial set of parameter values and initial conditions, a procedure which is referred to by engineers as *simulation*. The fit value is input into an optimization algorithm that updates parameter estimates. If the initial conditions **x**(0) are unavailable, they must be appended to the parameters ** θ** as quantities with respect to which the fit is optimized. The optimization process can proceed without using gradients, or these may also be approximated by solving the

*sensitivity differential equations*

In the event that **x**(0)=**x**_{0} must also be estimated, the corresponding sensitivity equations are

Systems for which solutions beginning at varying initial values tend to converge to a common trajectory are called *stiff* and require special methods that make use of the Jacobian ∂*f*/∂*x*.

The NLS procedure has many problems. It is computationally intensive since a numerical approximation to a possibly complex process is required for each update of parameters and initial conditions. The inaccuracy of the numerical approximation can be a problem, especially for stiff systems or for discontinuous inputs such as step functions or functions concentrating their masses at discrete points. The size of the parameter set may be increased by the set of initial conditions that are needed to solve the system, and the data may not provide much information for estimating them. NLS also produces only point estimates of parameters and, where interval estimation is needed, much more computation can be required. As a consequence of all this, Marlin (2000) warned process control engineers to expect an error level of the order of 25% in parameter estimates.

A Bayesian approach which may escape minor ripples in the optimization surface is outlined in Gelman *et al*. (1996). This model uses a likelihood centred on the numerical solution to the differential equation , such as . Since has no closed form solution, the posterior density for ** θ**∣

**y**has no closed form and inference must be based on simulation from a Metropolis–Hastings algorithm or other sampler. At each iteration of the sampler,

**is proposed and the numerical approximation is used to compute the likelihood. Parallels between this approach and NLS mean that they share many of the same optimization problems. To fix this, the Bayesian model often requires strong finitely bounded priors. Extensions to this method are outlined in Campbell (2007).**

*θ*#### 1.3.2. Collocation methods or basis function expansions

Our own approach belongs in the family of *collocation* methods that express the approximation of *x*_{i} in terms of a basis function expansion

where the number *K*_{i} of basis functions in vector *φ*_{i} is chosen to ensure enough flexibility to capture the variation in the approximated function *x*_{i} and its derivatives. Typically, this will require substantially more flexibility than is required to fit the data, since and must also satisfy the differential equation to an extent that is considered acceptable. Although the original collocation methods used polynomial bases, spline basis systems are now preferred because they allow control over the smoothness of the solution at specific values of *t*, including discontinuities in or higher order derivatives that are associated with step and point changes in the inputs **u**. Using a spline basis to approximate an initial value problem is equivalent to the use of an implicit Runge–Kutta method for stepping points located at the knots defining the basis (Deuflhard and Bornemann, 2000). For solving boundary value problems, collocation tries to satisfy system (1) at a discrete set of points, resulting in a large sparse system of non-linear equations which must then be solved numerically.

Collocation with spline bases was applied to dynamic data fitting problems by Varah (1982), who suggested a two-stage procedure in which each *x*_{i} is first estimated by data smoothing methods without considering expression (1), followed by the minimization of a least squares measure of the fit of to with respect to ** θ**. The method is attractive when

**f**is nearly linear in

**, but non-linear in**

*θ***x**. Varah's approach worked well for the simple equations that were considered, but considerable care was required in the smoothing step to ensure a satisfactory estimate of , and the technique also required that all variables in the system be measured.

Ramsay and Silverman (2005) and Poyton *et al.* (2006) took Varah's method further by iterating the two steps, and replacing the previous iteration's roughness penalty by a penalty on using the last minimizing value of ** θ**. They found that this process,

*iterated principal differential analysis*, converged quickly to estimates of both

**x**and

**that had substantially improved bias and precision. However, iterated principal differential analysis is a joint estimation procedure in the sense that it optimizes a single roughness-penalized fitting criterion with respect to both**

*θ***c**and

**, an aspect that will be discussed further in the next section.**

*θ*Several procedures have attempted to solve the parameter estimation problem at the same time as computing a numerical solution to expression (1). Tjoa and Biegler (1991) proposed to combine a numerical solution of the collocation equations with an optimization over parameters to obtain a single constrained optimization problem; see also Arora and Biegler (2004). Similar ideas can be found in Bock (1983), where the *multiple shooting method* was proposed that breaks the time domain into a series of smaller intervals, over each of which system (1) is solved.

### 1.4. Overview of the paper

Our approach to fitting differential equation models is developed in Section 2, where we develop the concepts of estimating functions and a generalization of profiled estimation. Section 3 tests the method on simulated data for the FitzHugh–Nagumo and CSTR equations, and Section 4 estimates differential equation models for data drawn from chemical engineering and medicine. Generalizations of the method are discussed in Section 5.