**1. Batch methods** |

Gradient-based method | Levenberg-Marquardt (LM) is a gradient-descent method used for parameter estimation in nonlinear models. It provides a numerical solution to the problem of minimizing a function that is generally nonlinear over a space of parameters of the function. These minimization problems arise especially in least squares curve fittings and nonlinear programming. Gradient-based algorithms follow identified directions within the parameter space. | These methods are highly efficient but are not best suited for highly dimensional nonlinear models as they may end up discovering local rather than global minima. LM is often combined with a quasi-Monte Carlo algorithm to search for global optimal values (Luo *et al.* 2003). They require the calculation of model output sensitivity to model parameters to determine posterior uncertainty (Santaren *et al.* 2007). | Levenberg (1944), Marquardt (1963) |

Non-gradient-based method | Non-gradient-based methods [also called ‘global search’ methods, e.g. genetic algorithms (GA), simulated annealing (SA) or Markov chain Monte Carlo (MCMC) methods] are often based on a random number generator (Braswell *et al.* 2005; Sacks *et al.* 2006), while GA and SA are typically applied to observations from a limited number of target variables. MCMC methods are a family of techniques that use Monte Carlo sampling to generate a discrete approximation of the posterior probability distribution of the parameter(s) that is/are sought for estimation. The Metropolis-Hastings sampler (Hastings 1970) is one of the more popular MCMC sampling algorithms in use. | The major advantage of global search methods is that they are able to treat all data simultaneously and, therefore, are more likely to discover the global minimum for the cost functions that possess multiple minima compared to gradient-based methods. The primary advantage of the Metropolis algorithm is to provide complete information concerning posterior distributions of parameters that can be used to generate standard errors and confidence intervals for both individual parameters and correlations between parameters. Markov chain Monte Carlo techniques iteratively produce a sample from a Bayesian posterior distribution of parameters until certain convergence criteria are met (Wang *et al.* 2009; Williams *et al.* 2009; Appendix S3). | Metropolis *et al.* (1953), Hastings (1970), Braswell *et al.* (2005), Sacks *et al.* (2006) |

Variational data assimilation (VDA) | VDA methods operate in a batch processing manner over a given time window that contains a sequence of observational time points. In weather forecasting, depending on the spatial and temporal dimensions of the state variables, VDA methods can be primarily classified into three categories: one-dimensional (1D-Var), three-dimensional (3D-Var) and four-dimensional (4D-Var). For 3D-Var, only those observations available at the time of analysis were used. For 4D-Var, past observations were included and, thus, time dimension was added. 4D-Var uses the tangent-linear and adjoint versions of the forecast model to estimate the 4D atmospheric states that best fits assimilated observations distributed over a specified time window (Gauthier *et al.* 2007). | VDA methods are much less expensive to produce computationally than are KF and EKF methods. In light of this, they are preferable for data assimilation for use with realistic, complex systems (e.g. a numerical weather prediction framework). In addition, by simultaneously using observations inside the assimilation interval, VDA methods are also more optimal than KF and EKF methods inside (within) the interval (at the end of the interval). However, the VDA method itself does not provide any estimate of predictive uncertainty. The adjoint method is able to calculate exact gradient information of the objective function that is to be optimized. Important scientific advances such as 4D-Var and improvements in error specifications in combination with a large increase in available observations has led to considerable improvements in overall forecasting performance (Table S1). | Daley (1991), Kalnay (2003), Gauthier *et al.* (2007), Lorenc & Payne (2007) |

**2. Sequential methods [Kalman Filter (KF)]** |

Extended KF | The KF is a sequential method for estimating the state of a system. EKF is the nonlinear version of the Kalman filter (Evensen 1992). | Unlike its linear counterpart, EKF is not ordinarily an optimal estimator. Another problem with EKF is that the estimated covariance matrix tends to underestimate the true covariance matrix and, therefore, risks becoming inconsistent in a statistical sense without the addition of ‘stabilizing noise’. EFK can yield unstable results when the nonlinearity in a complex model is strong (Evensen 1994). The application can result in unbounded error growth as soon as a system enters an unstable regime. In addition, the enormous computational time required is a serious disadvantage of EKF. Ensemble Kalman Filter (EnKF) was introduced to overcome the drawbacks of EKF (Evensen 1994). | Kalman (1960), Evensen (1992, 1994) |

Ensemble KF | It results in the optimal estimation of strongly nonlinear dynamical systems with Gaussian probabilities (Evensen 1994). | EnKF is suitable for problems that possess a large number of variables such as the discretization of partial differential equations in geophysical models (Evensen 1994, 2007). One advantage of EnKF is that advancing the probability density function in time is achieved simply by advancing each member of the ensemble. | Evensen (2003, 2007) |

Particle filtering (PF) | PF, also known as sequential Monte Carlo method and bootstrap filtering, is another commonly used data assimilation algorithm for the recursive estimation of model states (Arulampalam *et al.* 2002). It is typically used to estimate Bayesian models and is the sequential (online) analogue of the Markov chain Monte Carlo (MCMC) batch methods often similar in importance to sampling methods. | A well-designed PF can often operate much faster than MCMC. PF is typically used as an alternative to EKF with the advantage that (with sufficient numbers of samples) it approaches the Bayesian optimal estimate. It, therefore, can achieve greater accuracy compared to EKF. PF carries out updates on particle weights instead of state variables. In addition, PF has the desirable characteristic of being applicable to any state-space model in any format whether linear or nonlinear or Gaussian or non-Gaussian. | Arulampalam *et al.* (2002) |