Online state and time‐varying parameter estimation using the implicit equal‐weights particle filter

A method is proposed for resilient and efficient estimation of the states and time‐varying parameters in nonlinear high‐dimensional systems through a sequential data assimilation process. The importance of estimating time‐varying parameters lies not only in improving prediction accuracy but also in determining when model characteristics change. We propose a particle‐filter‐based method that incorporates nudging techniques inspired by optimization algorithms in machine learning by taking advantage of the flexibility of the proposal density in particle filtering. However, as the model resolution and number of observations increase, filter degeneracy tends to be the obstacle to implementing the particle filter. Therefore, this proposed method is combined with the implicit equal‐weights particle filter (IEWPF), in which all particle weights are equal. The method is validated using the 1000‐dimensional linear model with an additive parameter and the 1000‐dimensional Lorenz‐96 model, where the forcing term is parameterized. The method is shown to be capable of resilient and efficient parameter estimation for parameter changes over time in our application with a linear observation operator. This leads to the conjecture that it applies to realistic geophysical, climate, and other problems.


INTRODUCTION
Online parameter estimation is the process of inferring values that are often included in numerical models as unobservable quantities using sequentially collected observations.Since such parameters in numerical models are simplified representations of the modeled characteristics, parameter estimation plays an important role in obtaining accurate and reliable predictions.There are several approaches to parameter estimation, such as using an optimization algorithm under given state variables in the model and using data assimilation (DA) techniques (Evensen et al., 2022).
DA is known as the procedure to incorporate observations into numerical models and obtain posteriors of the state variables, especially in high-dimensional dynamical systems.Although DA usually focuses on generating an optimal initial state and forecasting the temporal evolution of millions of time-varying state variables (Clayton et al., 2013), parameter estimation is often combined to calibrate the models (i.e., estimate the appropriate model characteristics).Therefore, parameter estimation is key to improving the prediction accuracy and is as complex as state estimation due to nonlinearities, even for linear dynamical models (Evensen et al., 1998).
Further, parameters can be considered not only as static but also as time-variant.For example, in hydrological modeling, parameters are usually assumed to be constant and calibrated using a particular data record to obtain an optimal parameter set or stationary parameter distributions.Still, it is necessary to use time-variant parameters to accurately simulate state variables wherein the calibration period may contain different climate conditions and hydrological regimes compared with the simulation period (Deng et al., 2016).As another example, according to Zhu et al. (2017), state and parameter estimation plays an important role in the application of process monitoring, online optimization, and process control.The difficulty of these applications is in identifying changes in model parameters when the operating conditions of the processing system have changed, or some faults have occurred in the processing system.From the above examples, it can be seen that estimating time-varying parameters plays an important role not only in improving prediction accuracy but also in determining when model characteristics change abruptly.However, the challenging issue is to distinguish whether the cause of the inaccuracy is incorrectly estimated state variables or a change in the model characteristics (i.e., parameters).
A typical method for time-varying state and parameter estimation in high-dimensional dynamical systems is the state augmentation technique, in which the parameter vector is incorporated into the state vector.This technique is also called joint estimation.Generally, the Kalman filter-based method is used for linear Gaussian systems, whilst the particle filter (PF) based method can be applied to nonlinear non-Gaussian systems.Santitissadeekorn and Jones (2015) indicate that the state augmentation method may become ineffective when the impact of parameters on the state is weak, and they propose a two-stage filter that combines a PF and an ensemble Kalman filter.This method estimates the static parameters and the tracking of the dynamic variables alternatively.Although similar approaches using an independent dual PF (Cooper & Perez, 2018) and a nested hybrid filter (Pérez-Vieites et al., 2018) have been proposed, they are only applicable to the estimation of static parameters.Extension to time-varying parameters requires identifying whether the change in observed states originates from state variables or parameters, but the amenability in practical contexts depends on the cross-covariance between states and parameters.In particular, detecting abrupt changes in characteristics in high-dimensional and partially observed nonlinear systems may be problematic because of the relatively low correlation between the observed state and parameters.
Another issue concerns nonlinearities due to the temporal evolution of the system and augmented state vector.As in the example using PF above, the parameter estimation method combined with PF can deal with nonlinearities, but filter degeneracy might be a critical obstacle for high-dimensional systems such as geophysical and climate systems.To overcome this problem, several approaches have been proposed, including the PF method by hybridizing with the ensemble Kalman filter (EnKF: Santitissadeekorn & Jones, 2015), as mentioned above.The approach of the equivalent-weights particle filter (EWPF: e.g., Van Leeuwen, 2010;Ades & Van Leeuwen, 2015) allows the proposal density to depend on all particles at the previous time step and assigns equivalent weights to most particles to avoid filter degeneracy.Zhu et al. (2016) proposed the implicit equal-weights particle filter (IEWPF), which combines the method of EWPF and implicit sampling (Chorin & Tu, 2009) to eliminate the need for parameter tuning.Skauvold et al. (2019) proposed a two-stage IEWPF method to correct the systematic bias in predictions caused by a gap in the proposal distribution in IEWPF (Zhu et al., 2016).Other approaches to eliminate filter degeneracy are also reviewed in Van Leeuwen et al. (2019).However, the above methods focus on estimating state variables or constant parameters.
In this article, we focus on a nonlinear time-varying system where the dimension of the state vector is large, while that of the model parameters is comparatively small, with a view to application in geophysical, climate, and other high-dimensional contexts.Then, we propose a new PF-based parameter estimation method and assess the capability of detecting abrupt changes in characteristics by applying it to the above system.We provide a methodology and results based on the IEWPF of Zhu et al. (2016) as an example of avoiding filter degeneracy.In our application, we assume a linear observation operator and require partial derivatives with respect to the parameters depending on the dimension of the parameters, although the methodology does apply to nonlinear observation operators and can work with approximate derivatives.
The remainder of the article is organized as follows.Section 2 describes the methodology for estimating time-varying parameters.First, to estimate states and parameters simultaneously, we extend IEWPF to an augmented state-space model with a correlated covariance matrix.We then propose the IEWPF-based method that incorporates an optimization algorithm from machine learning into the parameter time evolution model by taking advantage of the flexibility of the proposal density in particle filtering.In Section 3, the effectiveness and advantages of the proposed method are evaluated through comparison with a method without incorporation of an optimization technique by using the linear model and the Lorenz-96 model (Lorenz, 1996).A summary and conclusions are put forward in Section 4.

Correlated perturbation in augmented state-space model
A typical state-space model for a nonlinear system containing model parameters is described as where x n is the state variable at time step n and y n is the observation vector at time step n. f is the known possibly nonlinear function that maps the state from time t n−1 to t n , and H x is the known nonlinear observation operator. is the vector of model parameters, the true values of which are unknown and possibly time-varying. is a random model perturbation drawn from the model-error probability density function (pdf)  (0, Q  ), while the observation error  is drawn from the observation-error pdf  (0, R).
To estimate time-varying parameters sequentially, the state vector is updated according to the following dynamical system by augmenting parameters as artificial states: Here,  n is a random parameter perturbation drawn from the pdf  (0, Q  ), and we require that f is a differentiable function with respect to the parameter.Then, the above state updating function f can be approximately expressed by a first-order Taylor series expansion at the previous parameter  n−2 : Then, by using the time evolution model in the previous time step n − 1: we can rewrite Equation 2 as where we introduce the augmented vector z n = [x nT ,  n−1 T ] T , model f , and perturbation ρ representation.
We also rewrite the observation operator H x in Equation 1as follows: The augmented perturbation ρ can drawn from the error pdf  (0, Qn ), which is expressed as where  ′n = (f ∕) n−1 +  n .Since model perturbation  and parameter perturbation  are independent of each other and both have zero means, each matrix element in Equation 7 can be calculated as follows: Then, Equation 7 can be expressed as Note that the Taylor expansion in Equation 3 is used up to the first-order term, so the augmented perturbation ρ from Q includes the linear impact of the parameters on the model evolution over one time step.

State and parameter update with IEWPF
In this section, we explain how to apply the IEWPF to the update equation Equation 5and how to avoid filter degeneracy.Considering a Markovian system with observational errors that are independent from one time to another, the prior pdf can be written as Then, plugging Equation 12 into Bayes Theorem as a prior pdf, the posterior pdf of the model state given observations can be written as (13) Suppose we run a particle filter, and the particle weight for the ensemble at the previous time step n − 1 is given by Then plugging Equation 14 into Equation 13, we can obtain Introducing the proposal density q(z n |Z n−1 , y n ), which is conditioned on all particles at time n − 1, which indicated by the Z n−1 , Equation 15 can be expressed as The well-known problem of filter degeneracy means the weight will concentrate on only some particles, and most particles will have a negligible weight after a few propagations.Snyder et al. (2015) described that the particle filter using the optimal proposal yields minimal degeneracy and provides performance bounds.This could be a serious obstacle to implementing the particle filter when the number of states and observations increases, that is, a high-dimensional system.Therefore, we use the IEWPF (Zhu et al., 2016), which can avoid this filter degeneracy problem.From Equation 14, Equation 16 can be expressed as where w i is the weight for particle i and is expressed as follows using the proposal density expressed in Equation 16: Instead of drawing directly from proposal density q, we can draw a standard Gaussian distributed proposal density q(), which is related by where ‖dz∕d‖ denotes the absolute value of the determinant of the Jacobian matrix, which expresses the following transformation: where  n i express the mode of q(z n |Z n−1 , y n ), P is a measure of the width of that pdf, and  i is a scalar factor.Note that this expression is similar to the original IEWPF (Zhu et al., 2016), but z n i denotes the augmented vector z n = [x nT ,  n−1 T ] T .This means that transformed variable  also has the dimension of the augmented vector.Then, Equation 18 can be expressed as follows: In general, the  n i can be obtained via a minimization of − log q ( z n |Z n−1 , y n ) , similar to for example, a 3DVar, and also the equal weights can be obtained numerically.In this article, we will follow Zhu et al. (2016) and assume a linear observation operator, which will allow for an analytical solution for the equal weights.

Linear observation model and Gaussian error
Assuming the linear observation model H and Gaussian model and observation error as shown in Equations 5 and 6,  n i in Equation 20 can be expressed as explained in Zhu et al. (2016): where and P in Equation 20 is Note that Q is the model-error covariance matrix described in Equation 11and R is the observation-error covariance matrix.Therefore, from Equations 20-22, equal-weight particle z i sampled from posterior pdf Equation 16 can be constructed using the scalar factor  i .
The factor  i needs to be determined so that the weight of each particle i represented by Equation 21is the same target weight for all particles.Introducing w prev i , which denotes the weight from previous time steps, we can express Equation 21as With the above Gaussian assumption, we can write where (27) Taking the logarithm of Equation 25 leads to Substituting Equations 26 and 20 in Equation 28, we find Using Equation 20and the simplified expression for the Jacobian in Zhu et al. (2016), we can rewrite where N x is the dimension of the model state.Setting the weights of all particles to the target weight is equivalent to setting all log w i equal to the constant C, which leads to the following equation for  i : in which constant value 2 log Here, let c i denote the log-weight offsets for each particle i from the target weight C as In practice, this c i can be determined using the values of  for all particles as Therefore,  i is obtained as a solution satisfying Equation 31 with c i determined by Equation 33.Further assuming that the factor  i depends on  n i only through (see Appendix in Zhu et al., 2016).For every particle to reach the target weight, c i ≥ 0 should be satisfied, therefore 0 < exp (−c i ∕2) ≤ 1 in Equation 34.Furthermore, since the function of the left-hand side exp (−g i ∕2)(g i ) N x ∕2−1 has an extremum at  et al. (2016), Equation 34 can be integrated from N∕2 to ∞, then yields the following equation: where  is the monotonically decreasing upper incomplete gamma function.Therefore the solution  i for every particle i that satisfies Equation 35 is allowed both  ≤ 1 and  ≥ 1 theoretically.Although  ≥ 1 solutions are known to lead to systematic bias (Zhu et al., 2016), the bias decreases when the state-space dimension N x increases, that is, the high-dimensional case.As another solution, Skauvold et al. (2019) proposed the two-stage IEWPF that can eliminate this bias.
In practice, the following should be considered when generating the posterior distribution by calculating  i that satisfies Equation 35.The first point is the computational cost of finding  i numerically for each particle.To avoid this calculation, Zhu et al. (2016) proposed an approximation under the limiting case of N x → ∞.Then, the solution  can be expressed analytically using the Lambert W function (Corless et al., 1996), which has two branches:  > 1, which gives a large ensemble spread, and  < 1, which gives the opposite effect.The authors proposed adjusting the ratio of sampling  i for each particle i from either branch in order to bring the shape of the distribution closer to the ideal one.The results of this  dependence will be shown later.The second point is the guarantee of convergence to the posterior distribution.IEWPF can equalize the weights of all particles, but the convergence of the filter distribution to the posterior distribution was only confirmed experimentally by Zhu et al. (2016) and not shown theoretically.

Parameter nudging with proposal density
The effectiveness of the method proposed in the previous section, which augments parameters as artificial states, depends on the cross-covariance between states and parameters.To improve the accuracy and resilience of time-varying parameters, we introduce an optimization algorithm from machine learning into the parameter time evolution model using the flexibility of the proposal density in particle filtering.According to Equation 11, the model transition density is expressed as The prior pdf expressed in Equation 12 is allowed to both divide and multiply the model transition density by a proposal transition density q, leading to (37) Drawing from p ( z n |z n−1 ) corresponds to using the original model transition density Equation 36.Still, we could instead draw from q ( z n |Z n−1 , y n ) , which would correspond to any other model transition that we choose.This allows us to control the transition of both state and parameters by choosing proposal density q.
Sequential observation data can be considered as samples for the stochastic gradient descent (SGD) algorithm based on the similarity between sequential DA and online learning or stochastic optimization, in that the data are given sequentially.The ideas in stochastic optimization have advanced in recent years in machine learning and deep learning with large-scale data.The basic problem structure classification and associated solutions are summarized in Hannah (2015).The effectiveness of SGD for large-scale learning problems, that is, cases with large-scale data, is also described in Bottou (2010).The optimization algorithm used in the proposed method is described in the next section.Assume an objective function L n i () and consider the problem of minimizing this function, where the parameter  minimizes L n i ().The parameter  n can be updated by the following iteration: where  is the step size, sometimes called the learning rate in machine learning contexts.The function g n expresses the update rule for the parameter.
Here, we consider introducing the above parameter update analogy to the transition density modification.In the next step of the last observation n, that is, n + 1, let us assume that instead of original transition density Equation 12, the proposal density q at time step n + 1 for augmented state z can described as where the augmented nudging term is denoted as gn .Therefore, the step size  and the function g( n−1 i , y n ) have the same role as Equation 38 and together express the nudging term forcing estimated model parameters towards true values, and y n is the last observed data vector.

Qn
is the same augmented model-error covariance matrix as described in Equation 11 with correlated perturbation.Then updating of the augmented state vector after the last observation step n is given as follows, instead of the original updating expressed in Equation 5: where This corresponds to only the modification of augmented perturbation ρn+1 , which shifts the mean value of parameters.Note that sampling from this proposal transition density instead of the original model is compensated by an extra weight as described in Ades and Van Leeuwen (2015):

Adam-method-based parameter nudging
As mentioned above, we introduced a nudging term for the parameters by taking advantage of the flexibility of the proposal density in particle filtering.One of the main points in this article is that we can choose any term that forces the parameters toward the true value.Therefore, our scheme is combined with a well-known gradient descent optimization algorithm that has evolved in recent years as deep learning progresses (Alom et al., 2018).In general, a task in machine learning and deep learning is often expressed as the problem of finding parameters that minimize (or maximize) the objective function, and the key is how quickly the optimal parameters can be found.Typical optimization formulations and algorithms are summarized in Sun et al. (2019).
Regarding gradient-based optimization algorithms, Ruder (2016) showed a classification of algorithms and a description of typical examples.Momentum-based algorithms accumulate a decaying sum of the previous gradients into a momentum vector and use that instead of the true gradients.This method has the advantage of accelerating optimization along dimensions where the gradient remains relatively consistent and slowing it along turbulent dimensions where the gradient is significantly oscillating.Another approach is norm-based algorithms, which divide a portion of the gradient by the L 2 norm of all previous gradients.This has the advantage of slowing down along dimensions that have already changed and accelerating along dimensions that have only changed slightly.In our method, we use the adaptive moment estimation (Adam) proposed by Kingma and Ba (2014), which combines the above two approaches.
Our proposed formulation of the function g( n−1 i , y n ) for the parameter nudging term in Equation 39 is as follows.First, f (z n−1 i ) can be regarded as the expected value of z n i given z n−1 i and is defined by Next, we chose the log-likelihood of p ( y n |z n i ) as the aforementioned objective function L n i in Equation 38 as follows: Here, Equation 44 can be calculated from the likelihood with respect to the observed value y n at observation step n and ensemble member i, given z n i , as follows: Then, we define the function g( n−1 i , y n ) in Equation 39 by using the gradient of the objective function L n i as follows.Following Kingma and Ba (2014), we introduce the moving averages of the gradient and the squared gradient, and denote them as m n i and v n i , respectively.Their update equations are expressed using the gradient of L n i as follows: where the hyperparameters  m and  v control the decay rate of these moving averages.Note that the gradient ∇  L n i requires computing the partial derivatives of the likelihood with respect to the parameters in Equation 45 or an approximation thereof.Since these moving averages are initialized (as a vector of zeros), the moment estimates are biased toward zero, especially during the initial time step and especially when the decay rates are low (i.e.,  m and  v are chosen to be close to 1).Therefore, m n i and v n i in Equation 46 are modified as follows to cancel these biases: Finally, the function g( n−1 i , y n ) expressed in Equation 39 is yielded as follows: Here, the factor √ vn i represents the L 2 norm of the past gradients via the v n−1 i term and current gradient in Equation 46, and scales the gradient.Note that  is a factor to avoid dividing by zero and set to 1.0 × 10 −8 in the following experiment.
The proposed method contains two procedures dependent on the observation: (1) state and parameter update by IEWPF and computation of likelihood gradient at the observation step, and (2) parameter nudging with proposal density between observations.The algorithm is summarized as follows: (1) State and parameter update at the observation step  48, by using hyperparameters  m ,  v , and step-size factor .
(2) Parameter nudging at the forecast step • The time step t + 1 in the next step after observation, for each particle i: -Generate parameter perturbation using the computed parameter nudging term g (  t−1 i , y t ) from Equation 41.
-Compute extra weight in Equation 42.

NUMERICAL EXPERIMENTS
The effectiveness of the proposed method is demonstrated through two synthetic test cases as follows.The first case is the linear model with additive parameters, where all model states are observed directly at every time step.Although this article focuses on a nonlinear system, we use a linear model to verify that the shape of the posterior pdf is close to the true one.The second case is the Lorenz-96 model (Lorenz, 1996) with parameterized forcing, where only the model states are observed directly at every fourth step.

Linear model with an unknown parameter
In order to compare the estimates of the proposed method with the analytically calculated true values, we use the following linear model as the time evolution expressed in Equation 2: where x ∈ R N x is the model state vector with dimension N x and  ∈ R N  is the parameter vector with dimension N  . and  are random perturbations drawn from the model-error pdf  (0, Q  ) and parameter error pdf  (0, Q  ), respectively.The matrix F x ∈ R N x ×N x and F x ∈ R N x ×N  represent the linear model.Here, we define the matrices F and G as follows: Then, Equation 49 can be rewritten by using Equation 4 as follows: When the initial prior pdf is Gaussian, the true posterior pdf should also be Gaussian.Assuming that the posterior pdf at time n − 1 is Gaussian with covariance matrix P n−1|n−1 , the predicted covariance matrix P n|n−1 of the prior pdf expressed in Equation 51 can be calculated as follows: where and this term is equivalent to Equation 11 when using the linear model F defined in Equations 50 and 51.
In the following experiments, we choose the dimension of the model state N x = 1000 and the parameter N  = 1, in order to consider a simple high-dimensional system with a parameter.Setting the model F x = I, F x = 0.1, the time evolution model described in Equation 51 and observation model are expressed as ) , where index  = 1, … , N x indicates the elements of the model states x.Here, the observation model H = (I 0), assuming that all variables are observed, and  is the observation error drawn from the observation-error pdf  (0, R).
Since we assume a time-independent state transition matrix F, the covariance matrix satisfying the linear system defined by Equation 54 converges to the steady-state matrix P such that P n|n−1 = P n−1|n−2 ≡ P, and satisfies the discrete-time Riccati equation (Wonham, 1968) as follows: Therefore, the shape of the true posterior pdf of Equation 54 can be obtained by solving Equation 55numerically and compared with the distribution obtained from the proposed IEWPF.
The procedure of the comparison using synthetic data is as follows.Let us assume the initial ensemble members z 0 i are sampled from the background error  (0, B).First, one member from the ensemble generated under the model-error covariance matrix Q and the background-error covariance matrix B is used as the "truth".Observations are then created from this "truth" and the observation error defined by covariance matrix R. In the following experiments, the true value of the parameter is 0, and the true model-error covariance matrix Q is chosen as a diagonal matrix with the main diagonal value 0.04 for states and 0 for the parameter.The background-error covariance matrix B is a diagonal matrix with the main diagonal values of 1 for states and 0 for the parameter.The observation-error matrix R is diagonal, and the main diagonal value is set to 0.01.
Next, for the assimilation, we choose the same matrix Q  , B for states, and R as when the observation was generated.The matrix Q  and B for parameters are set to be the same as those of the states.The number of particles is set to N = 20 to demonstrate the validity of the estimation with few particles.Regarding observations, consider the condition that all model state variables x are observed at every step.Note that the step size  in Equation 39 is set to 0 to evaluate the parameter augmentation method of IEWPF described in Section 2.2.In order to investigate the dependence of the aforementioned  i on the shape of the posterior pdf, we compare the variance of pdfs estimated with the values sampled from the  i ≥ 1 branch at three sampling percentages: 0%, 50%, and 100%.Note that 50% means sampling from both branches of  i ≥ 1 and  i ≤ 1, which is the closest to the true pdf according to Zhu et al. (2016).Thus, 0% and 100% mean sampling only from  i ≤ 1 branch and  i ≥ 1 branch, respectively.
Figure 1 shows histograms of variance accumulated from the 20th to 1000th steps for comparing the two sampling cases of  with the diagonal value of R = 0.01.The variances of both (a) states Var(x) and (b) parameter Var() are averaged over the dimension, that is, N x = 1000 and N  = 1 for the variables and parameter, respectively, and the number of particles N p for each dimension, as follows: where the index  denote the elements of the states x, and x n and  n are the ensemble mean.Note that the dimension of the parameter  is one.The true variances based on the solution of Equation 55 are shown as "True".From these comparisons, both the states and parameter variances are close to the "True" value when sampling 50% from the  i ≤ 1 branch.On the other hand, when sampling only from the  ≤ 1 branch and the  ≥ 1 branch, we see that the variance becomes smaller and larger with the same trend as for Zhu et al. (2016), respectively.Figure 2 compares the posterior pdf obtained in the 50% sampling case with the true pdf for the diagonal value of R of 0.01.Since the ensemble size is too small compared with the number of model dimensions, both of the estimated pdfs are shown as the histogram accumulated over the time evolution from 20th to 1000th steps for the state and parameter, respectively.From Figure 2a,b, we see that the obtained pdf of the state x 1 and parameter  is close to the true pdf.
These results indicate that the method of extending IEWPF to the proposed augmented state-space model is valid, and the variance and shape of the posterior pdf for the parameter are also close to those of true pdf under the condition that the variance and shape of the posterior pdf for the state are close to those of true pdf.

Lorenz-96 model with parameterized forcing
The Lorenz 1996 model with parameterized forcing is used as the time evolution expressed in Equation 1to I G U R E 2 Posterior pdf represented by the particles using the 50% sampling case compared with true pdf (full line) for (a) state x 1 of element one and (b) parameter , respectively.explore the validity of the proposed method in a nonlinear high-dimensional system.The original Lorenz-96 model (Lorenz, 1996) is the dynamical nonlinear model given by where index  = 1, … , N  with cyclic indices, x  is the state variable of the model at position , N  is total dimension, and F  is the forcing function parameterized by for which c 0 , c 1 , c 2 are true values, and  0 ,  1 ,  2 are their scale parameters that have to be estimated.For the evaluation of nonlinearity, this value of F  , which is typically chosen to be 8 or more to generate chaotic behavior, is set as follows.The values of c 0 , c 1 are set to 8, 4 respectively, and c 2 is set to the same value as the dimension of the model state: N  .Then, the scale parameters  0 ,  1 ,  2 are estimated, and their true values are 1 each.By introducing this parameterized forcing term F  ( 0 ,  1 ,  2 ), each state variable x  contains a parameter-dependent chaotic behavior.This model is numerically solved by the fourth-order Runge-Kutta scheme with a time step of  t = 0.05.The procedure for the following experiment is the same as for the previous linear model.The true model-error covariance matrix Q  for states is chosen as a tridiagonal matrix, the main diagonal value being 0.10 and both sub-and superdiagonal values being 0.025.The background-error covariance matrix B is a diagonal matrix with the main diagonal value 1 for states.In the experiments below, the true observation-error matrix R is diagonal, with main diagonal values of 0.02.For the assimilation, we choose the same matrix Q  , B for states and R as when the observation was generated, that is, the true one.The matrices Q  , B for parameters are diagonal matrices with main values 5.0 × 10 −6 , 0.001, respectively.The step size  for the Adam method is set to 0.001.The number of particles is set to N = 20 to demonstrate the validity of the estimation with few particles.To consider high-dimensional cases, N  is chosen as 1000, the same as in the linear-model experiment.
In contrast to the previous evaluation using the linear model and a static parameter, this experiment investigates the ability of the proposed methods for estimating time-varying (i.e., dynamic) parameters in nonlinear high-dimensional systems.Regarding observations, consider the condition that all of the model states are observed every fourth step (i.e., the assimilation interval is 4).Moreover, this 1000-dimensional evaluation with only 20 particles can validate its feasibility to apply to realistic geophysical, climate, and other problems.First, we compare the methods outlined in Section 2 in terms of the RMSE and the ensemble spread (Spread).Next, we compare the impact of the parameter error covariance Q  and the step size factor  on the ensemble.The performance indicator of parameter estimation is not only the RMSE but also the ratio of the RMSE to the spread in the ensemble, and it is preferable that their ratio becomes one for Gaussian variables.Note that, for non-Gaussian variables, this is only true for the forecast ensemble (Fortin et al., 2014).

3.2.1
Comparison of the methods Figure 3 compares the true values and particle trajectories in the three methods mentioned above for the state x 1 and the three scale parameters  0 ,  1 ,  2 .All variables are observed every four steps, setting the main diagonal value of matrix R to 0.02.Each true parameter is increased by 30% at the 200th step, as the dashed red line shows.The figure shows the difference in tracking performance of the three methods for abrupt parameter changes and the advantage of the proposed method.The method shown in Figure 3a MH1 is the conventional augmented method expressed as Equation 2. There are some steps where the trajectories of each ensemble deviate from the true trajectory in the state, and the ensemble spreads out greatly and cannot track abrupt changes in all three parameters.Then, both of the methods shown in Figure 3b MH2 and Figure 3c MH3 are based on the proposed state-space model expressed as Equation 5with the covariance matrix Q.The method shown in Figure 3c MH3 further applies the Adam-method-based nudging described in Section 2.5 with step-size factor  = 0.001.
The results for the state show that the trajectories of each ensemble are close to the true trajectory.Although both methods tend to approach the true values for  0 and  2 , the Adam-method-based nudging is more accurate and responsive to abrupt changes, especially for  1 .Figure 4a,b shows the comparisons of time series RMSE for the states and parameters, respectively.The horizontal axis indicates the time steps in the 100th-600th steps, where the difference between methods is significant in Figure 3.For the state, since the assimilation interval is four, each value represents the average of all elements (i.e., 1000) for the third step, which has the largest prediction error after filtering, while for the parameter, the average values of all elements (i.e., 3) for all steps are shown.The results show that the estimation error of the parameters after the parameter abrupt change (200th step) increases the error in the forecast step of the model states, and the estimation error of the proposed method (MH3) decreases the fastest for both states and parameters.
Figure 5a,b shows the RMSE and spread comparisons for the states and parameters, respectively.Each box plot shows the time-averaged RMSE and spread at the forecast and filtering steps in the 100th-1500th steps shown in Figure 3, including the abrupt change (at 200th steps).Therefore, the interquartile range (IQR) of the box plot indicates the dispersion across the dimensions of the model states (1000) and parameters (3).Note that outliers are not plotted, to exclude estimation errors immediately after abrupt changes in the 200th step.From the result for the states shown in Figure 5a, the proposed methods (i.e., MH2 and MH3) have smaller RMSE values and dispersion than the conventional methods (i.e., MH1), especially in the forecast step.The result for the parameters shown in Figure 5b clearly shows that both the RMSE values and dispersion of MH3 (i.e., with nudging) are smaller than the others, and the spread is also smaller.The fact that the RMSE dispersion of MH3 is smaller than that of MH2 means that the difference in RMSE in the three parameters is small.Thus, the proposed nudging method reduces differences in estimation accuracy for each parameter, which is the effectiveness of combining IEWPF with Adam.

3.2.2
Dependence of parameter error covariance and step-size factor In the following, we investigate the impact of the parameter error covariance Q  and the step-size factor  on estimation accuracy (RMSE) and ensemble spread (spread).Figure 6 shows the true values and the particle trajectories of the scale parameter  0 under the combination of different values of Q  and , respectively.Note that Q  is chosen as a diagonal matrix and we denote it as Q  =  2  I.The graph shown in Figure 6 as exp2 is the reference condition with  2  = 5.0 × 10 −6 ,  = 0.001, and is the same graph shown for scale parameter  0 in Figure 3c.The other graphs exp1, exp3, and exp4 in Figure 6 show the cases where  2  is 1.0 × 10 −6 , 1.0 × 10 −5 , and 5.0 × 10 −5 , respectively, under the same value of  = 0.001.These graphs show that the larger the parameter covariance, the larger the ensemble spread and the less overshoot after the parameter abrupt change.
Next, we quantitatively evaluate the impact of the parameter error covariance Q  on the ensemble.Figure 7 shows the dependence of the parameter error covariance Q  on RMSE and spread for (a) states and (b) parameters, respectively.Each box plot shows the time-averaged RMSE and spread at the forecast and filtering steps in the 100th-1500th steps.The forecast RMSE and spread include three cycles of forecast steps, since the filtering interval is four.The four values of  2  shown on the horizontal axis are for exp1, exp2, exp3, and exp4 in Figure 6.Note that outliers are not plotted to exclude estimation errors immediately after abrupt changes in the 200th step.
For the states, we can see from Figure 7a that neither the value of RMSE nor the value of spread depends on the diagonal value of the parameter error covariance Q  .In addition, the values of forecast RMSE and spread are close, that is, their ratio is close to one.On the other hand, for the parameters, Figure 7b shows that as the diagonal values  2  increase, the values of spread also increase, and  the values of RMSE decrease.Especially in the case of  2  = 5.0 × 10 −5 , the values of forecast RMSE and spread are close, that is, their ratio is close to one.
Figure 8 shows the true values and the particle trajectories, as in Figure 6.The graph of exp2 is the same as in Figure 6 exp2 of the reference condition with  2  = 5.0 × 10 −6 ,  = 0.001.The exp5, exp6, and exp7 in Figure 8 show the cases where  is 0.0005, 0.002, and 0.004, respectively, under the same value of  2  = 5.0 × 10 −6 .These graphs show that the larger the step-size factor, the faster the value approaches the true value after the abrupt change, but the more likely it is to overshoot.
Figure 9 shows the dependence of the step-size factor  on RMSE and spread for (a) states and (b) parameters, respectively.Each box plot shows the time-averaged RMSE and spread at the forecast and filtering steps during the 100th-1500th steps, and the forecast RMSE and spread include three cycles of forecast steps, as in Figure 7.The four values of  shown on the horizontal axis are for exp5, exp2, exp6, and exp7 in Figure 8.Note that outliers are not plotted as in Figure 7. Similarly to the trend shown in Figure 7, there is almost no dependence of the step-size factor  on the RMSE and spread for states.For parameters, the spread does not increase even as the step-size factor  increases, but the RMSE decreases, that is, the ratio of the forecast RMSE to spread approaches one.

Dependence of observation error and number of observations
In order to evaluate the dependence of the observation error and number of observations, we compare the large step-size condition:  = 0.004 (exp7) with two additional experiments (exp8 and exp9).The first (exp8) is the case where the main diagonal value of the matrix R is large, and in the following, the value is set to 0.08.Note that this experiment (exp8) uses observation data generated at R = 0.08.Hence, R for data generation and assimilation are the same value.The second (exp9) is when the state is observed at every other grid point, so that H In both additional experiments, the conditions of the step size and the diagonal value of the parameter error covariance are the same as for exp7, that is,  = 0.004,  2  = 5.0 × 10 −6 .Figure 10 shows a comparison of RMSE and spread for different observation conditions for (a) state and (b) parameter.The description of the box plot is the same as in Figure 9. Figure 10 exp7 shows the results of the reference condition, that is, R = 0.02, and all model states are observed.From the comparison of the state in Figure 10a exp7 and exp8, the change in R from 0.02-0.08increases both RMSE and spread, but spread is somewhat more pronounced.For the parameter in Figure 10b, RMSE values and dispersion tend to increase compared with spread.From comparison of the state in Figure 10a exp7 and exp9, because the number of observed variables was reduced to half, both RMSE and spread are increasing except for the filtering value of the observed variable.As for the parameters, both RMSE and spread show a small increase in median values, but an increase in dispersion.The results indicate that increasing observation error and decreasing observation density increase differences in estimation accuracy between parameters.In other words, the decrease in observed information has reduced the estimation accuracy of parameters with little impact (i.e., low sensitivity) on the model state.This could potentially be mitigated by adjusting the step size and the parameter error covariance.

CONCLUSION
This article proposed a resilient and efficient state and time-varying parameter estimation method for nonlinear high-dimensional systems through a sequential DA process.First, we introduced an extension of IEWPF to an augmented state-space model with a correlated covariance matrix.We then proposed the IEWPF-based method that incorporates the nudging technique inspired by optimization algorithms in machine learning into the parameter time evolution model by using the flexibility of the proposal density in particle filtering.
The performance of the method is examined in the 1000-dimensional linear model and nonlinear Lorenz-96 model.Experiments using the linear model with the static parameter indicate that the impact of the scalar factor  on the variance of the parameter is similar to that on the variance of the state.Numerically, under the condition that the variance and shape of the posterior pdf for the states are close to the true ones, those for the parameter are also close to the true ones.
The experimental results of the nonlinear Lorenz-96 model with the time-varying parameters show the following points.First, the proposed state augmentation method successfully estimates states and parameters simultaneously, even when the number of ensemble members is much smaller than the model dimension.This result indicates that filter degeneracy is avoided when extending to an augmented state-space model.Second, the proposed parameter nudging method inspired by optimization algorithms accelerates the tracking for abrupt parameter changes and reduces the difference in estimation accuracy for each parameter.This result suggests the effectiveness of combining IEWPF with Adam, one of the optimization algorithms.Thirdly, from evaluating the impact of the parameter error covariance and the step-size factor on the time-averaged RMSE and the ensemble spread (spread), the former increases the spread and decreases the RMSE, while the latter decreases the RMSE.Properly determining these values so that the ratio of the RMSE to the spread approaches one will allow for good ensemble generation.However, its systematic method will be a subject of future research.Finally, from evaluating the dependence of the observation error and number of observations, the decrease in observed information has reduced the estimation accuracy of parameters with little impact (i.e., low sensitivity) on the model state.This could potentially be mitigated by adjusting the step-size factor and the parameter error covariance.Alternatively, it may be beneficial to narrow the parameters to be estimated to those with high sensitivity through a preliminary sensitivity analysis.
In the numerical experiments in this article, the Lorenz-96 model with parameterized forcing was used mainly to evaluate the nonlinearity of time evolution of the model states, but further investigation of the nonlinearity of the parameters is needed.Adam optimization is a first-order gradient-based method, and it is widely used to learn the weights in deep neural networks, that is, nonlinear functions.Thus, our Adam-based nudging term can work theoretically in nonlinear problems.However, even for nonlinear convex problems, there are conditions and limits to convergence, and new methods have been proposed (Reddi et al., 2018).Furthermore, convergence for nonconvex problems is still an open question, though Chen et al. (2019) developed an analysis framework and a set of sufficient conditions that guarantee convergence.Therefore, the applicability of the proposed method to various nonlinear problems in data assimilation needs to be investigated and is a topic for future research.
In this article, we applied the proposed online parameter estimation scheme to IEWPF as an example of a PF that can avoid filter degeneracy.The method is shown to be capable of resilient and efficient parameter estimation for time-varying parameters.The results lead to the conjecture that the proposed method is applicable to realistic geophysical, climate, and other problems.Since several approaches have been proposed to avoid filter degeneracy (e.g., Skauvold et al., 2019), the evaluation of another combination will be a subject of future research.

F
Histogram of cumulative variance comparing the diagonal values of R = 0.01 for (a) states and (b) parameter, respectively.Three sampling percentages from the  ≤ 1 branch: 100%, 50%, and 0% are compared with the true variance (dashed line).

E 3
Comparison of estimated state and parameter trajectories between (a) conventional augmented method (MH1),(b)   without nudging:  = 0 (MH2), and (c) with nudging:  = 0.001 (MH3).The solid lines show each of the 20 ensemble members, and the dashed lines show the true parameter value.Only the 1350-1500th steps are shown for the state, and each true parameter is increased by 30% at the 200th step.

F I G U R E 4
Comparison of time series RMSE after parameter abrupt change (200th step) between augmented method (MH1), without nudging:  = 0 (MH2) and with nudging:  = 0.001 (MH3) as per Figure 3.The third step after the filtering for the (a) state and all steps for the (b) parameter are shown.Each value is averaged over all elements.

F
Box plot showing the comparisons of RMSE and spread for forecast and filtered ensembles between augmented method (MH1), without nudging:  = 0 (MH2) and with nudging:  = 0.001 (MH3) as per Figure 3.Each IQR indicates the dispersion of the (a) state and (b) parameter elements averaged over the forecast and filtering steps in 100-1500, respectively.Outliers are not plotted.

F
Comparison of estimated parameter trajectories between different values of  2  : 1.0 × 10 −6 (exp1), 5.0 × 10 −6 (exp2), 1.0 × 10 −5 (exp3), and 5.0 × 10 −5 (exp4) under the same value of  = 0.001.The solid lines show each of the 20 ensemble members, and the dashed lines show the true parameter value.Each true parameter is increased by 30% at the 200th step.F I G U R E 7 Box plot showing the comparison of RMSE and spread for each of the forecast and filtered ensembles between different values of  2  = 1.0 × 10 −6 , 5.0 × 10 −6 , 1.0 × 10 −5 , and 5.0 × 10 −5 as per Figure 6.Each IQR indicates the dispersion of the (a) state and (b) parameter elements averaged over the forecast and filtering steps in 100-1500, respectively.Outliers are not plotted.

F I G U R E 8
Comparison of estimated parameter trajectories between different values of : 0.0005 (exp5), 0.001 (exp2), 0.002 (exp6), and 0.004 (exp7) under the same value of  2  = 5.0 × 10 −6 .The solid lines show each of the 20 ensemble members, and the dashed lines show the true parameter value.Each true parameter is increased by 30% at the 200th step.F I G U R E 9Box plot showing the comparison of RMSE and spread for each of the forecast and filtered ensembles between different values of  = 0.0005, 0.001, 0.002, and 0.004 as per Figure8.Each IQR indicates the dispersion of the (a) state and (b) parameter elements averaged over the forecast and filtering steps in 100-1500, respectively.Outliers are not plotted.F I G U R E 10Box plot showing the comparisons of RMSE and spread for forecast and filtered ensembles between the large step-size condition (exp7), large observation error: R = 0.08 (exp8), and partially observed (exp9).Each IQR indicates the dispersion of the (a) state and (b) parameter elements averaged over the forecast and filtering steps in 100-1500, respectively.Outliers are not plotted."Ob" and "Uo" represent observed and unobserved states.

•
Sample initial particle for state x 0 i and parameter  0 i , i = 1, … , N.