Data Assimilation Networks

Data assimilation (DA) aims at forecasting the state of a dynamical system by combining a mathematical representation of the system with noisy observations taking into account their uncertainties. State of the art methods are based on the Gaussian error statistics and the linearization of the non-linear dynamics which may lead to sub-optimal methods. In this respect, there are still open questions how to improve these methods. In this paper, we propose a fully data driven deep learning architecture generalizing recurrent Elman networks and data assimilation algorithms which approximate a sequence of prior and posterior densities conditioned on noisy observations. By construction our approach can be used for general nonlinear dynamics and non-Gaussian densities. On numerical experiments based on the well-known Lorenz-95 system and with Gaussian error statistics, our architecture achieves comparable performance to EnKF on both the analysis and the propagation of probability density functions of the system state at a given time without using any explicit regularization technique.


Context
In Data Assimilation (DA), the time dependent state of a system is estimated using two models that are the observational model, which relates the state to physical observations, and the dynamical model, that is used to propagate the state along the time dimension (Asch et al., 2016). These models can be written as a Hidden Markov Model (HMM).
Observational and dynamical models are described using random variables that account for observation and state errors. Hence DA algorithms are grounded on a Bayesian approach in which observation realizations are combined with the above statistical models to obtain state predictive and posterior density sequences. This estimation is done in two recursive steps: the analysis updates a predictive density into a posterior one with an incoming observation; and the propagation updates a posterior density into a the next cycle predictive (or prior) density.
DA methods use additional assumptions or approximations to obtain closed expressions for the densities so that they can be handled by computers. Historically in the Kalman filter (KF) approach, statistical models are assumed to be Gaussian and the physical dynamics are assumed to be linear (Kalman, 1960). Hence, the propagation and analysis steps consist in updating mean and covariance matrix of Gaussian densities. A correct estimation of the covariance matrices is crucial since they determine to what extent the predictive density will be corrected to match observations. In the Ensemble Kalman Filter (EnKF) approach, the covariance matrices are represented by a set of sampling vectors to reduce the computational cost of the filter (Evensen, 2009). When EnKF is used with a small number of ensembles, the covariance matrix estimation becomes low-rank. This causes some spurious correlations in the covariance matrix which are filtered by using regularization techniques such as localization and inflation (Hamill et al., 2001;Houtekamer & Mitchell, 2001;Asch et al., 2016). EnKF can be used for nonlinear dynamics, however due to the truncation of the statistics up to the second order, in the limit of large ensembles the EnKF filter solution differs from the solution of the Bayesian filter (Le Gland et al., 2011), except for linear dynamics and Gaussian statistics. Hence, when using these methods for non-linear and non-Gaussian setting there are still open questions in achieving an optimal prediction error in the Bayesian setting.
In this paper, we propose a general supervised learning framework based on Recurrent Neural Network (RNN) for Bayesian DA to approximate a sequence of prior and posterior densities conditioned on noisy observations. Section 2 explains the sequential Bayesian DA framework with an emphasis on the time invariant structure in the Bayesian DA which is the key property for RNNs. The proposed approach, Data Assimilation Network (DAN), is then detailed in Section 3 which generalizes both the Elman Neural Network and the Kalman Filter. DAN approximates the prior and posterior densities by minimizing the log-likelihood cost function based on the information loss, related to the cross-entropy. The details of the cost function and the theoretical results for the optimal solution of the cost function are presented in Section 3.4. The practical aspects of the DAN including the architecture and computationally efficient training algorithm are given in Section 4. We then evaluate the performance of DAN by using fully and partially observed Lorenz-95 system with Gaussian prior and posterior densities in Section 5. The Lorenz-95 system is non-linear and it is often used as a first-step in meteorology to investigate potential applications of the proposed method to high-dimensional chaotic systems. We compare the performance of DAN with state-of-the-art EnKFs methods, IEnKF-Q and LETKF, in terms of root mean square errors, and we also provide the stability analysis with respect to the initial condition and the forecast time-interval beyond the training range. Finally, we provide the conclusions in Section 6.

Related work
With the advances in machine learning and deep learning, there has been significant increase in the research of using ML to forecast the evolution of physical systems with a data-driven approach (Brunton et al., 2016;Rudy et al., 2017;Raissi et al., 2019Raissi et al., , 2017aRaissi et al., , 2017bLi et al., 2020;Jia et al., 2021). Recently, this research has its significant impact on the design of advanced DA algorithms. We next outline three main directions that are related to our research in the hybridization of DA and ML approaches.
In a first direction, one addresses the traditional DA problem where the goal is to estimate the distribution of a state sequence x t conditioned on an observation sequence y t , by using explicitly an underlying dynamical model M. Harter and de Campos Velho (2012) propose to use Elman Neural Network to learn the analysis equation of KF type algorithm where the dynamics are nonlinear. Their main aim is to reduce the computational complexity without affecting the accuracy. McCabe and Brown (2021) focus on the learning of the analysis equation within an EnKF framework. They propose the Amortized Ensemble Filter which aims to improve existing EnKF algorithms by replacing the EnKF analysis equations with a parameterized function in the form of a neural network.
In a second direction, one aims to learn an unknown dynamical model M from noisy observations of y t . This direction is more ambitious compared to the first one as the dynamics to be learnt can be non-linear or even chaotic. Bocquet et al. (2019) propose to use the Bayesian data assimilation framework to learn a parametric M from sequences of observations y t . The dynamical model is represented by a surrogate model which is formalized as a neural network under locality and homogeneity assumptions. Bocquet et al. (2020) extends this framework to the joint estimation of the state x t and the dynamical model M with a model error represented by a covariance matrix. They estimate the ensembles of the state by using a traditional Ensemble Kalman Smoother based on Gaussian assumption, and then with the given posterior ensemble they minimize for the dynamical model and its error statistics. Similarly, Brajard et al. (2020) propose an iterative algorithm to learn a neural-network parametric model of M. With a fixed M, it estimates the state x t using the observations y t , and then uses the estimated state to optimize the parameters of M. A related work is from Krishnan et al. (2015), which introduces a deep KF to estimate the mean and the error covariance matrix in KF to model medical data, based on variational autoencoder (Girin et al., 2021).
A third direction, which is what we consider in the present paper, is to estimate the distribution of a state sequence x t conditioned on a observation sequence y t , without explicitly using the underlying dynamical model M in the propagation. This direction often uses training data in a supervised form of (x t , y t ). For instance, Fablet et al. (2021) propose a joint learning of the NN representation of the model dynamics and of the analysis equation albeit within a traditional variational data assimilation framework. A related work to learn a surrogate model is Revach et al. (2022), which proposes a parametric KF to handle partially known model dynamics, replacing explicit covariance matrices by a parametric NN to estimate the model error. Penny et al. (2022) learns also a surrogate model, based on recurrent neural networks, by using only state sequence (x t ) which is then used in a deterministic EnKF framework.
All these approaches consider improving the DA methodologies which are based on an existing DA algorithm. In this work, we propose a fully data driven approach for Bayesian data assimilation without relying on any prior DA algorithm that can be sub-optimal in case of non-Gaussian error statistics and non-linear dynamics.

Notation
We denote a state random variable at time t as x t taking their values in some space X = R n of dimension n. An observation random variable at time t is denoted by y t taking its values in some space Y of dimension d (often R d ). We write a sequence of random variables x 1 , · · · , x t as x 1:t . A joint probability density of two sequence of random variables x 1:t and y 1:t with respect to the Lebesgue measure on the finite dimensional Euclidean space X t × Y t is written as p(x 1:t , y 1:t ) = p x1:t,y 1:t (x 1:t , y 1:t ). We denote the value (realization) of a random variable x as x. The set of pdfs over X is denoted by P X . A conditional pdf for x t conditioned on the value of y t , i.e. y t = y t is written as p xt|y t (·|y t ) ∈ P X . Given a function f (x) on a measurable space of X with measure p, we say f (x) = 0 p-almost everywhere x (p-a.e. x in short), when there exists a measurable set A with p(A) = 0 such that f (x) = 0 for all x ̸ ∈ A.

Sequential Bayesian Data Assimilation
In this section, we review the Bayesian optimal solution of sequential Bayesian data assimilation for an observed dynamical system and use its repetitive time-invariant structure to motivate the introduction of the DAN framework.

Sequential Bayesian Data Assimilation
Data assimilation aims to estimate the state of a dynamical process which is modeled by a discrete-time stochastic equation and observed via available instruments which can be modeled by another stochastic equation (Asch et al., 2016). These equations are given by the following system: where M(·) is the nonlinear propagation operator that acts on the model state random variable vector at time t, x t ∈ X and return the model state vector x t+1 ∈ X. H(·) is the nonlinear observation operator that acts on the state random variable x t and approximately returns the observation random variable y t ∈ Y at time t. Both of these steps may involve errors and they are represented by an additive model error, η t , and an additive observation error, ε t . For example, the observation operator may involve spatial interpolations, physical unit transformations and so on, resulting in measurement errors. We assume that these stochastic errors are distributed according to the pdf p η and p ε and they are i.i.d. along time, independent to the initial state x 1 . Using these assumptions DA problem can be interpreted as a Hidden Markov Model (Carrassi et al., 2018).
Given such a dynamical model, sequential Bayesian DA aims at quantifying the uncertainty over the system state each time an observation sample becomes available. Such an analysis starts by rewriting, under suitable mathematical assumptions, the DA system in terms of conditional probability density functions p xt|xt−1 (·|x t−1 ) ∈ P X which represents (1a), and p y t |xt (·|x t ) ∈ P Y which represents (1b). Using these densities, we can quantify the uncertainty of the state as a function of the observations. This can be done in two steps sequentially using the Bayesian framework: the analysis step and the propagation (forecast) step. Let p b t := p xt|y 1:t−1 be the prior distribution of x t given y 1:t−1 , and p a t := p xt|y 1:t be the posterior distribution of x t given y 1:t . The analysis step computes p a t (·|y 1:t ) ∈ P X from p b t (·|y 1:t−1 ) ∈ P X based on Bayes rule, p a t (·|y 1:t ) = p y t |xt (y t |·) p b t (·|y 1:t−1 ) p y 1:t−1 (y 1:t−1 ) .
Here, p y t |xt (y t |·) is considered as a likelihood function of x t , and p y 1:t−1 is a marginal distribution of observations. Similarly, the propagation step computes p b t+1 (·|y 1:t ) from p a t (·|y 1:t ), The analysis and forecast steps are then repeated within a given number of cycles (time interval) in which the forecast step provides a prior density for the next cycle.
Performing the analysis and propagation steps in (2) and (3) with linear dynamics for the propagation operator M(·) and the observation operator H(·), and using a Gaussian assumption for the probabilities p ε and p η reduces to the well known Kalman filter (KF, (Kalman, 1960)). The challenge is that the calculation of the pdfs become intractable with nonlinear operators or non-Gaussian pdfs of the error terms. When the dynamics are nonlinear, ensemble type KFs such as Ensemble KF (Evensen, 2009) are widely used alternative methods, but when used with limited number of ensembles, they require additional remedies (see Section 3.3 for further discussions).

Time-invariant structure in the Bayesian Data Assimilation
We review the invariant structure of the Bayesian Data Assimilation (BDA) for the Hidden Markov Model (HMM) defined in Section 2.1, which is a key property to motivate the DAN framework. Following the i.i.d. assumptions that we have made on the errors in (1a) and (1b), the conditional pdfs p xt+1|xt and p y t |xt are time invariant, in the sense that for t = 1, 2, . . .
for all u, v ∈ X and y ∈ Y.
As a result, the conditional pdfs representing the HMM are time invariant in the following sense. The analysis step (2) can then be considered as a time invariant function, a BDA , which operates on the prior cpdf, p b t (·|y 1:t−1 ) ∈ P X and a current observation, y t ∈ Y, and then return a posterior cpdf p a t (·|y 1:t ) ∈ P X : Similarly, according to (3), the propagation transformation can be considered as a time invariant function, b BDA , that transforms a posterior pdf to a prior pdf, This presentation of the sequential BDA allows us to see the DA cycle as the composition of two time invariant transformations a BDA and b BDA , i.e. each transformation is produced using the same update rule applied to the previous transformations. Exploiting this repetitive time invariant structure, corresponding to a chain of events, leads to a general framework named as the DAN based on recurrent neural networks (RNNs). We detail these ingredients of the DAN in Section 3 and Section 4.

Data Assimilation Networks (DAN)
In section 3.1 we present DAN, a general framework for DA, which generalizes traditional data assimilation algorithms such as the Kalman filter and the EnKF detailed in Section 3.2 and 3.3. Thanks to the repetitive structure of BDA, we propose in Section 3.4, a log-likelihood cost function based on the information loss to approximate conditional pdfs. Instead of calculating the posterior pdfs analytically, DAN aims to learn these pdfs by using sequences of (x t , y t ) generated from the HMM. We show theoretically that this framework allows one to handle nonlinear model dynamics and non-Gaussian error distributions where the Bayesian conditional pdfs are not necessarily Gaussian.

DAN framework
For a given set S, DAN is defined as a triplet of transformations such that The term "procoder" is a contraction of "probability coder" as the function c transforms an internal representation into an actual pdf over X. A representation of a DAN is given by Figure 1. When S = P X and c is identity, this framework encompasses the transformation of a BDA and b BDA in the BDA as a special case. However, it includes also other DA algorithms such as Kalman Filter and Ensemble Kalman Filter. Such connections are detailed in Section 3.2 and 3.3.
One important ingredient of DAN as a general framework for cycled DA algorithms is the use of memory to transform prior and posterior densities from one cycle to the next one. In this respect, S can be interpreted as a memory space which is a vector space within the DAN framework. Considering DAN as a RNN with memory usage naturally make the link with the well-known Elman Network. This connection is detailed in Section 4.1.
As a recurrent neural network, we can unroll DAN into a sequence of transformations. Given an initial memory s a 0 ∈ S 0 , and an observation trajectory y 1:T ∈ Y T , a DAN recursively outputs a predictive and a posterior sequence such that for 1 ≤ t ≤ T , This recursive application is represented in Figure 1. Note that {q b t } T t=1 and {q a t } T t=1 are candidate conditional densities. This means that for a given sequence of observations y 1:t = (y 1 , · · · , y t ), we have q b t (·|y 1:t−1 ) ∈ P X and q a t (·|y 1:t ) ∈ P X . However, these candidate conditional densities are not required to be compatible by construction with a jointdistribution over X T × Y T . As a consequence, we do not assume that there is some joint distribution q(x 1:T , y 1:T ) which induces the q b t (·|y 1:t−1 ) and q a t (·|y 1:t ). However, as we shall see in Section 4, the construction of DAN using recurrent neural networks implicitly imposes some relationships between these candidate conditional densities.

The Kalman Filter as a DAN
In the original Kalman filter (KF) (Kalman, 1960), both the propagation operator M and the observation operator H are assumed to be affine. In this case, the analysis and propagation transformations preserve Gaussian pdfs that are easily characterized by their mean and covariance matrix. The analysis and propagation transformations then simplify to algebraic expressions on these pairs as we shall see in this section.
Suppose that the internal representation of a Gaussian pdf is formalized by the injective transformation, c KF : where s := (µ, Σ), µ and Σ being the mean and covariance matrix respectively and Z X is the set of mean and covariance matrix pairs over X, G X is the set of Gaussian pdfs over X. The KF analysis transformation is the function that transforms such a prior pair in Z X and an observation y in Y into the posterior pair in Z X , i.e. a KF : Z X × Y → Z X , given by with When the dimension of the observation y t is less or equal to the dimension of the state x t , as an alternative we can obtain The mapping diagram for the analysis step of the KF is given by the diagram in Figure 2, which is a commutative diagram. We remind that a diagram is said to commute if any two paths between the same nodes compose to give the same map (Barr & Wells, 1991). Right: Commuting diagram for the KF propagation.
As well, the KF propagation transformation is the function that transforms a posterior pair in Z X into the next cycle prior in Z The mapping diagram for the propagation step of the KF is given by the diagram in Figure 2, which is a commutative diagram.
Unfortunately, the linearity of M and H is rarely met in practice and covariance matrices may not be easy to store and manipulate in the case of large scale problems. A popular reduced rank approach is the ensemble Kalman filter that has proven effective in several large scale applications.

The Ensemble Kalman Filter as a DAN
In the Ensemble Kalman Filter (EnKF) (Evensen, 2009), statistics (µ, Σ) ∈ Z X are estimated from an ensemble matrix X ∈ X m = R n×m having m columns with the empirical estimators ∈ R m×m and I m ∈ R m×m is the identity matrix (Fillion et al., 2020). Thus, the algebra over mean and covariance matrices pairs can be represented by operators on ensembles. In this approach nonlinear operators can be evaluated columnwise on ensembles and ensembles with few columns may produce lowrank approximations of large scale covariance matrices. Hence ensembles are an internal representation for the pdfs that are transformed by the function into a Gaussian pdf, c EnKF : when the error covariance matrix XU X T is full-rank, for instance when m ≥ n. In the case when m < n, the error covariance matrix become rank deficient resulting in spurious correlations. In this rank-deficient case, we must select a different base measure where the Gaussian distribution is supported, by using generalized inverse of XU X T (Rao, 1973).
The EnKF analysis transformation is the function that transforms such a prior ensemble X b ∈ X m and an observation y ∈ Y into the posterior ensemble X a ∈ X m , a EnKF : X m ×Y → X m , given by As well, the EnKF propagation transformation is the function that transforms a posterior ensemble X a ∈ X m into the next cycle prior ensemble where W ∈ X m is a column matrix consisting of m samples distributed according to the Gaussian pdf N (0 n , Q).
In EnKF, as explained above the mean and the covariance matrix for the Gaussian pdf are calculated through ensembles and propagation is performed through the ensembles using nonlinear dynamics. For large-scale nonlinear systems, when one can use only a limited number of ensembles, the error covariance matrix become a rank deficient matrix. This leads to sub-optimal performance (Asch et al., 2016) and may introduce errors during the propagation. For instance, spurious correlations may appear or ensembles may collapse. As a result, for a stable EnKF regularization techniques like localization and inflation needs to be applied (Hamill et al., 2001;Houtekamer & Mitchell, 2001;Gharamti, 2018). Localization consists in filtering out the long-distance spurious correlations in the error covariance matrix. It is not straightforward to find the optimal parameters for the localization, therefore some tuning is required. After filtering out these spurious correlations such that the analysis is updated by the local observations, there may be still problem with the use of limited ensembles along the propagation. These small errors may be problematic when they are accumulated through the cycles. This can still lead to filter divergence. A common solution is to inflate the error covariance matrix by an empirical factor slightly greater than one. The multiplicative inflation compensate errors due to a small size of ensembles and the approximate assumption of Gaussian distribution on the error statistics (Bocquet, 2011).

DAN log-likelihood cost function
In this section, we introduce a cost function which allows one to optimize the candidate conditional densities, i.e. q a t and q b t , based on samples of x 1:T and y 1:T . The distance between the target conditional densities p b t and p a t and the candidate conditional densities q b t and q a t are minimized in the sense of the information loss, related to cross-entropy (Cover & Thomas, 2005).
Proof. See Appendix A.
Theorem 1 shows that the objective function of DAN can approximate the Bayesian prior and posterior cpdf when the candidate pdfs belong to a general functional class (i.e. q ∈ P). However, the loss function J (q) can not be numerically computed without making the functional class more specific. As a common specific case, we next consider candidate conditional pdfs as the Gaussian pdfs.
Let G X be the set of Gaussian pdfs over X, and q in Definition 1 to be well-defined, it is necessary to assume that the target prior and posterior distributions p b t (·|y 1:t−1 ) and p a t (·|y 1:t ) have firstorder and second-order moments. Under these assumptions, Theorem 2 shows that using Gaussian pdfs, one can match the correct mean and covariance of the target prior and posterior cpdf.
Proof. See Appendix B.
Theorem 2 indicates that DAN has the capacity to optimally capture non-linear dynamics in terms of first and second-order statistics. Note that here the optimality is defined with respect to the cost function (12). Contrary to KF-based approaches, DAN never uses Gaussian approximations in its internal computations. DAN fits the output of the recurrent neural network with Gaussian pdfs.

DAN construction and training algorithm
Having specified the cost function in the previous section, we are now going to discuss how to construct the components of a, b, c in DAN in order to fit training data samples. To motivate the DAN construction, we first review its connection with the classical Elman network in Section 4.1. We then specify the construction of a DAN using recurrent neural networks in Section 4.2. Section 4.3 and 4.4 describe how to efficiently train the network.

Connection with Elman network
DAN can be interpreted as an extension of an Elman network (EN) (Elman, 1990) which is a basic structure of recurrent network. An Elman network is a three-layer network (input, hidden and output layers) with the addition of a set of context units. These context units provide memory to the network. Both the input units and context units activate the hidden units; the hidden units then feed forward to activate the output units (Elman, 1990). A representation of an EN is given in Figure 3.
The context units make the Elman network able to process variable length sequences of inputs to produce sequences of outputs as shown in Figure 3. Indeed, given a new input y t ∈ Y in the input sequence, the function a updates a context memory from ℓ t−1 ∈ C to a hidden state memory s t = a (ℓ t−1 , y t ) ∈ S. And the function c decodes the hidden state memory into an output w t = c (s t ) ∈ W in the output sequence. The updated hidden state memory is transferred to the context unit via a function b. In a way, the context memory  of an Elman network is expected to gather relevant information from the past inputs to perform satisfactory predictions. The training process in machine learning will optimally induce how to manipulate the memory from data.
The similarity between DAN and EN can be made explicit with the analogy that the hidden layer is connected to the context units by the function b, which includes time propagation for DAN. In DAN the hidden unit memory S is considered as the same set as the context unit memory C, and c function decodes both the hidden and the context unit memory into a probability density function.
The EN can not perform DA operations in all its generality. For instance, EN can not make predictions without observations, that is estimating strict future states from past observations. This is because the function a performs both the propagation and the analysis at once. In a way, the EN only produces posterior outputs and no prior outputs while the DAN produces prior or posterior outputs by applying the procoder c before or after the propagator b (see Figure 1 and Figure 3). DAN can also produce strict future predictions without observations by applying the propagator b multiple times before applying the procoder c. Second, the DAN provides a probabilistic representation of the state i.e. an element in P X instead of an element in X. Also, note that the compositions of b and c make a generalized propagation operator as it propagates in time probabilistic representations of the state rather than punctual realizations.

Construct DAN using Recurrent Neural networks (RNN)
We propose to use neural networks to construct a parameterized family of DANs. Let θ denote all the weights in neural networks, and the memory space S be a finite-dimensional Euclidean space. The parametric family of the analyzers and propagators are L-layer fully connected neural networks: The construction of a θ is built upon L fully-connected layers with residual connections. It is based on the LeakyReLU activation function (Bing et al., 2015) to improve the trainability when L is large. For layer ℓ, the input Taking a vector v as its input, the LeakyReLU function outputs a vector w of the same size. For the i-th element of w, An extra linear layer is then applied to the output v L in order to compute a memory state as the output of a θ . The trainable parameters of a θ are (α ℓ , W ℓ , β ℓ ) ℓ≤L and the weight and bias in the linear layer. As illustrated in Figure 1, the input a θ at time t is a concatenation of s b t and y t , i.e. v 0 = (s b t , y t ). Similarly, b θ is constructed from the same L fully-connected layers as in (14) by using a different set of trainable parameters. The input of b θ at time t is set to s a t .
The procoder c θ is specified with respect to the pdf choice of candidate conditional densities. For instance, for the Gaussian case studied in Theorem 2, c θ can be defined as: which is a linear layer from S to R n+ n(n+1) 2 , followed by a function that transforms the n + n(n+1) 2 dimensional vector into the mean and the covariance of a Gaussian distribution. This transformation is detailed in Appendix C.

Training and test loss from unrolled RNN
In order to train a DAN, we will unroll the RNN defined by (a θ , b θ , c θ ) so as to define the training loss computed from I i.i.d trajectories of (x 1:T , y 1:T ). We also define the test loss to evaluate the performance of training.
To be clear on how the states s a t and s b t depend on a θ , b θ and a given trajectory y 1:t , we will denote the state (memory) at time t informed by the data up to time t 1 and generated using a θ-parametric function as s b,θ t|t 1 . Then we can rewrite s b t and s a t more explicitly as: where s a,θ 0|0 = s 0 is an initial memory of RNN independent of θ. The procoder c θ outputs the pdf q b,θ t|t 1 (·|y 1:t−1 ) = c θ s b,θ t|t 1 , and q a,θ t|t (·|y 1:t ) = c θ s a,θ t|t .
To define the training loss computed from the I trajectories, we introduce a trajectorydependent loss function which will be needed to define our online training strategy. Let x (i) 1:T , y (i) 1:T be the i-th trajectory, we write the loss function for the i-th trajectory as: The training loss is defined accordingly as a function of θ, We define the test loss J(θ), as in (18), by using another I independent trajectories of (x 1:T , y 1:T ). It allows one to evaluate how well a DAN learns the underlying dynamics of HMM beyond the training trajectories.

Online training algorithm: TBPTT
Direct optimization of the training loss in (18) is impractical for large-scale problems since to compute the gradient of the loss, with back-propagation through time, it requires a large computational graph that consumes a lot of memory (Jaeger, 2002). This limits the training data size T I which, in turn, might lead to overfitting due to limited data. A workaround is to resort to gradient descent with truncated backpropagation through time (TBPTT, (Williams & Peng, 1990;Williams & Zipser, 1995)). It is commonly used in the machine learning community to train recurrent neural networks (Tang & Glass, 2018;Aicher et al., 2020).
Starting from θ 0 , TBPTT is an online method which generates a sequence of model parameters θ k for k = 1, 2, · · · , T . Instead of computing the gradient of the loss (18) with respect θ which depends on time from 1 to T , the idea of TBPTT is to truncate the computation at each iteration k by considering only a part of the gradient from time k − 1 to k. Each θ k is obtained from θ k−1 based on the information of I training trajectories More precisely, given the initial memories {s (i) 0 } i≤I and θ 0 , we update the memorȳ and then we perform the following gradient update, where η k is the learning rate. The learning rate is also called the step size in optimization. The gradient is computed over the I training trajectories at time k + 1. As a result, the optimization is not anymore limited in time due to computer memory constraints.
To adjust the learning rate η k adaptively, we apply the Adam optimizer (Kingma & Ba, 2014) to the gradient in (19). This simultaneously adjusts the updates of θ k based on an average gradient computed from the gradients at previous steps.

Numerical experiments
In this section, we present results of DAN on the Lorenz-95 system (Lorenz, 1995) using the Gaussian conditional posteriors presented in Theorem 2. We first explain Lorenz dynamics in Section 5.1, and provide experimental details in Section 5.2. Then, Section 5.3 evaluates the effectiveness of the online training method TBPTT. Section 5.4 compares standard rmses performance of DAN to state-of-the-art DA methods IEnKF-Q and LETKF using a limited ensemble memory. We further study the robustness of DAN in terms of its performance on future sequences beyond the horizon T of the training sequences, as well as its sensitivity to the initial distribution of each trajectory.

The Lorenz-95 system
The Lorenz-95 system introduced by Lorenz (1995) contains n variables x i , i = 1, . . . , n and is governed by the n equations: In Eq. (20) the quadratic terms represent the advection that conserves the total energy, the linear term represents the damping through which the energy decreases, and the constant term represents external forcing keeping the total energy away from zero. The n variables may be thought of as values of some atmospheric quantity in n sectors of a latitude circle.
In this study, we take n = 40 and F = 8 which results in some chaotic behaviour. The boundary conditions are set to be periodic, i.e., x 0 = x 40 , x −1 = x 39 and x 41 = x 1 . The equations are solved using the fourth-order Runge-Kutta scheme, with ∆t = 0.05 (a 6 hour time step).

Experiment setup
We study the performance of DAN when trained to map to Gaussian posteriors, i.e. the procoder c function is given by (15). This is compared to two state-of-art baseline methods of EnKF: Iterative EnKF with additive model error (IEnKF-Q) (Sakov et al., 2018) and Local Ensemble Transform Kalman filter (LETKF) (Hunt et al., 2007).
A batch of I trajectories of x ∈ R 40 is simulated from the resolvant (propagation operator) M : R 40 → R 40 of the 40 dimensional Lorenz-95 system. To start from a stable regime, we use a burning phase which propagates an initial batch of states {x (i) init } i≤I for a fixed number of cycles. The initial states are drawn independently from N (3 × 1 40 , I 40 ). The operator M is then applied 10 3 times (burning time) to the given initial batch of states (Sakov et al., 2018). The resulting states are taken as the initial state x (i) 1 .
After the burning phase, the Gaussian propagation errors {η i t }, sampled independently from N (0 40 , 0.01 × I 40 ), are added to each subsequent propagation to get the state trajectories Then the Gaussian errors ε In the numerical experiments we consider two cases for the observation network: (1) fully observed, i.e. H is taken to be the identity operator I, and (2) partially observed, i.e. H is taken as a uniform selection operator H 0 . For any 40-dimensional vector x, the vector H 0 x preserves half of the grid of x, by removing even-indexed elements of x. It is left as a future work to study cases where H is a nonlinear operator.

Setup of Baseline
The baseline methods, IEnKF-Q and LETKF, are implemented with explicit inflation or localization regularization in order to obtain a good estimation of the covariance matrix of Gaussian densities. Such regularization is often critical to the final performance of EnKF methods, and it often requires the tuning of hyper-parameters whenever the ensemble size m is changed (Asch et al., 2016).
To illustrate the sensitivity to the hyper-parameter tuning, we provide two set of experiments for LETKF: (1) (with case-by-case turning) The filter for each ensemble size is run with the best performance values provided in Table 1, found by a 2D grid search for each m, named as LETKF * . (2) (without case-by-case tuning) The filter for each ensemble is run with the best performance obtained at m = 20, i.e. the grid search is only run for this m and the obtained optimal hyper-parameters are used for all m. We name these experiments simply as LETKF.
We implemented the IEnKF-Q which uses only inflation regularization. This allows one to measure the effect of using both inflation and localization regularization in LETKF. We present results without case-by-case tuning across different number of ensembles for EnKF, i.e. m ∈ {5, 10, 20, 30}. As we do not have localization in IEnKF-Q, we fine-tune the inflation hyper-parameter of this method at m = 20 using grid-search. We find that on both fully observed and partially observed cases, a common inflation parameter 1.1 is close to be optimal among {1.0, 1.02, 1.03, 1.04, 1.07, 1.08, 1.09, 1.1, 1. search. Localization radius is chosen from the set {1, 2, 4} and the inflation hyper-parameter is chosen from the set {1.02, 1.03, 1.04, 1.07, 1.1}. We also use rotation after the analysis step which is shown to provide better performance for LETKF (Sakov & Oke, 2008). The inflation and localization radius hyper-parameter values that provide the best performance according to the time-averaged posterior (filtering) rmses are given in Table 1

Setup of DAN
To make DAN comparable to EnKF in terms of the used memory, we set the memory space S = R m×n . Similar to the results without case-by-case turning in LETKF and IEnKF-Q, hyper-parameters of DAN are only tuned at m = 20, and then fixed across all m.
Across m ∈ {5, 10, 20, 30}, DAN is trained with a batch size of I = 1024 of training samples for T = 6 × 10 5 cycles. The initial learning rate η 0 for the TBPTT is set to be 10 −4 . The initial memory s 0 of the RNN is set to be zero, while the initial parameter θ 0 of the RNN is mostly set to be random. More precisely, we use the standard random initialization for the weights (W, b) of each linear layer implemented in the Pytorch software. To train a neural network with a large number of layers L, we use the ReZero trick (Bachlechner et al., 2020) which sets the initial weight α ℓ in (14) to be zero for each ℓ. The functions a and b in the cost function of DAN are constructed by L = 20 fully connected layers with residual connections (as detailed in Section 4).

Training performance of TBPTT
To show the effectiveness of the training method TBPTT specified in (19), we evaluate the test loss J(θ) using I = 1024 i.i.d samples (defined in Section 4.3), on a sub-sequence of θ k . This allows one to access whether the online method is effective to minimize the total loss J (q) in (12). The training time of DAN grows with T but it is not sensitive to the choice of I. This is because our current implementation runs on GPU graphics cards, which allows the computation over I training samples to be in parallel. However, the sequential computation of TBPTT can not be done in parallel. One potential improvement of the running time is to use a modified version of TBPTT to improve the convergence rate, as suggested in (Chen et al., 2022, Algorithm 4.2).
The test loss J(θ k ) changes over iteration k are displayed in Figure 4. We observe that the minimal loss decreases as m increases, suggesting that the performance of DAN is improved with the memory size. Moreover, we find that the test loss decreases during the training process, which shows that TBPTT implicitly minimizes the test loss J(θ). In theory, we expect this to happen for a suitable large memory size m because it is proportional to the capacity of the neural networks used in DAN: a larger m implies a better approximation of the posterior distributions due to the universal approximation property of neural networks. The trade-off is that a too large m may lead to over-fitting (i.e. a large gap between the training loss and test loss), as we use only I finite trajectories of (x t , y t ) in the training algorithm.

Performance of DAN
After DAN is trained, new observation trajectories y t are generated from a new unknown state trajectory x t . These testing observations together with a null initial memory vector are then given as input of the trained DAN in a test phase and its outputs are compared with the unknown state x t .
To evaluate the accuracy of the trained DAN (k = T ), we compute the accuracy of the mean µ a t (resp. µ b t ) of q a,θ T t|t (·|y 1:t ) (resp. q b,θ T t|t−1 (·|y 1:t−1 )), evaluated on a test sequence (x 1:T , y 1:T ). A standard evaluation in DA is to compute rmses, i.e. for 1 ≤ t ≤ T , we compute the following normalized posterior and prior rmses, In Figure 5 and 6, we compare the averaged rmses of DAN with IEnKF-Q and LETKF when the ensemble size m is smaller than the dimension n of the state x t . For DAN, we report an averaged rmses over t, computed at the parameter θ T at the last step of training. These rmses are compared to the two baseline methods, IEnKF-Q and LETKF, over the same range of t. Recall that we use the same size m to define the memory space S = R m×n in DAN.
Let us first analyze the numerical results for the fully observed case. When m is small, IEnKF-Q performs worse than DAN, due to sampling errors. Note that with the choice F = 8 in the ), the model has 13 positive and one neutral Lypapunov exponents, i.e. the dimension of the unstable-neutral subspace is 14 (Trevisan et al., 2010;Bocquet & Carrassi, 2017;Sakov et al., 2018;Carrassi et al., 2022). Therefore, when the model is propagated through time, small perturbations grow along these directions (Carrassi et al., 2022). This explains why IEnKF-Q does not perform well when m ≤ 14, as a result we need to apply localisation and inflation techniques to reduce these sampling errors. As expected, LETKF and LETKF*, in which localization and inflation techniques are applied with turned parameter values, performs much better than the IEnKF-Q. DAN performs similarly. When m = 5, it is slightly better than LETKF* in the fully observed case. LETKF perfoms much worse than LETKF* and DAN, showing how sensitive the method is to the tuning of the inflation and localization hyper-parameters. When m becomes closer to n (e.g. m = 20, 30), we find that the posterior and prior rmses of DAN, IEnKF-Q, LETKF and LETKF * are similar, with better results for LETKF * . This tendency of rmses as a function the ensemble size m is strongly correlated with the smallest test loss achieved by DAN in Figure 4. We observe that for the partially observed case, conclusions are similar as well. These experiments clearly show that DAN can achieve a comparable performance without using EnKF-type regularization techniques.

Predictive performance and sensitivity to initialization
As DAN is trained on the time interval t ≤ T , it remains important to evaluate its predictive performance by considering how well it performs for t > T . Such performance can be measured by the average rmses over T + 1 ≤ t ≤ 2T instead of over 1 ≤ t ≤ T , evaluated using the trained model parameter (θ = θ T ). The posterior rmses for fully observed case are provided in Table 2. We find that the rmses over T + 1 ≤ t ≤ 2T are close to those over 1 ≤ t ≤ T . This suggests that DAN has learnt the dynamics of the Lorenz system in order to perform well on future trajectories.  All the earlier results are concerned of the performance of DAN under a fixed burning time. Using this burning time for the training of DAN, we further evaluate the rmses on test sequences which have a different burning time. It allows us to indirectly access how well recurrent structures inherited from the HMM are learnt. The results of the ensemble size m = 20 are given in Table 3. It shows that the performance of DAN is not sensitive to the distribution of the test sample x 1 initialized over a wide range of burning time.
We remark that among all the simulations, there is always a relatively large error in R a t and R b t for small t then it decreases very quickly (e.g. m = 20, burning = 1000, both R a t and R b t get close to a constant level when t ≥ 20). This transition is needed for DAN to enter a stable regime because the initial memory of the RNN is set to zero.

Conclusions
Based on the key observation that the analysis and propagation steps of DA consist in applying time-invariant transformations a and b that update the pdfs using incoming observations, we propose a general framework DAN which encompasses well-known state-of the art methods as special cases. We have shown that by optimizing suitable likelihood-based objective functions, the underlying posterior densities represented by these transformations have the capacity to approximate the optimal posterior densities of BDA. By representing a and b as neural networks, the estimation problem takes the form of the minimization of a loss with respect to the parameters of an extended Elman recurrent neural network. As a result, this general framework can be used for nonlinear dynamics and non-Gaussian error statistics.
In practice, we need to define the pdfs for the calculation of the loss function. As a first step and to be able to compare performance of DAN with the state-of-the-art ensemble methods, we perform numerical experiments with a procoder c which outputs a Gaussian pdf. Our numerical results on a 40-dimensional chaotic Lorenz-95 system show that when the ensemble size is small, DAN performs similarly compared to LETKF which includes regularization techniques such as localization and inflation. For large ensemble size, DAN has similar performance compared to IEnKF-Q and LETKF. It indicates that the DAN framework has the advantage of avoiding some problem-dependent numerical-tuning techniques. We also find that DAN is robust in terms of its predictive performance and its initialization.
Although we use a Gaussian approximation of the posterior densities in the procoder c, it can still happen that the memory space S may encode non-Gaussian information of the posterior distributions. To analyze why DAN can handle problems with nonlinear dynamics (even in other nonlinear dynamical systems) is left as a future study. From a practical point of view, DAN in its current form is not scalable to perform DA when the dimensionality n is very large (e.g. in the order of 10 9 ). To make DAN scalable, different training strategies Penny et al., 2022) will be considered in the future.