Uncertainty Quantification When Learning Dynamical Models and Solvers With Variational Methods

In geosciences, data assimilation (DA) addresses the reconstruction of a hidden dynamical process given some observation data. DA is at the core of operational systems such as weather forecasting, operational oceanography and climate studies. Beyond the reconstruction of the mean or most likely state, the inference of the state posterior distribution remains a key challenge, that is, quantify uncertainties as well as to inform intrinsical stochastic variabilities. Indeed, DA schemes, such as variational DA and Kalman methods, can have difficulty in dealing with complex non‐linear processes. A growing literature investigates the cross‐fertilization of DA and machine learning. This study proposes an end‐to‐end neural scheme based on a variational Bayes inference formulation to jointly address DA and uncertainty quantification. It combines an Evidence Lower BOund variational cost to a trainable gradient‐based solver to infer the state posterior probability distribution function given observation data. The inference of the posterior and the trainable solver are learnt jointly. We demonstrate the relevance of the proposed scheme for a Gaussian parameterization of the posterior and different case‐study experiments, including Lorenz 63 dynamics and river flow measurements. A benchmark with respect to state‐of‐the‐art schemes is provided.


Introduction
The reconstruction and forecasting of dynamical systems from available observations are key challenges in earth sciences (see e.g., Welch & Bishop, 1995).These tasks have been classically addressed by data assimilation (DA) approaches, especially variational DA and ensemble Kalman schemes (see e.g., Evensen, 2009).DA methods have greatly improved over years, especially by accounting for model error, which is important when dealing with misrepresented physical processes (Machenhauer & Kirchner, 2000) and unresolved small-scale processes (Hamill & Whitaker, 2005) with respect to the space-time model resolution.Whereas Kalman-based ensemble methods (Evensen, 1994;Gordon et al., 1993) do take into account model error from the beginning, in a variational setting, operational systems based on 4D-Var (Rabier et al., 2000) moved from strong constraints assumptions (Le Dimet & Talagrand, 1986) to weak constraints one (see e.g., Trémolet, 2006) to reach this goal.In both approaches, estimating the model error in the form of a model error covariance matrix becomes crucial.

10.1029/2022MS003446
2 of 17 allow to estimate uncertainties of the predicted state and have to be specifically tuned to this purpose (Isaksen et al., 2010), while Kalman-based ensemble methods provide a Gaussian estimate of the posterior distribution of the state through a covariance matrix updated at each time step (see e.g., Evensen, 2003;Evensen & Van Leeuwen, 2000) which is relevant in Gaussian-linear case and typically fail in cases with strong non-linearity (Evensen et al., 2022).Particle filters (Gordon et al., 1993;Van Leeuwen, 2009) are the main methods for sampling the full posterior probability distribution function (pdf), but they suffer from curse of dimensionality when dealing with high-dimensional states (Snyder et al., 2008).This may prevent their application to real-world cases.As variational Bayes (VB) refers to the field of research dedicated to approximating the full posterior of latent variables of a Bayesian model given observation data (Blei & Jordan, 2006;Jordan et al., 1999), we note that assessing the uncertainties in the predicted state is indeed a VB problem.Inferring the posterior through a VB formulation often requires to maximize an Evidence Lower BOund (ELBO; see e.g., Hoffman et al., 2013).To this end, learning-based approaches appeared to be particularly relevant (Kingma & Welling, 2013).
Recently, a rich literature has emerged to apply machine learning (ML) paradigms to address DA issues.ML schemes are particularly efficient to solve complex and high-dimensional optimization problems and encounter numerous successes including image classification (Krizhevsky et al., 2012;Le, 2013), natural language processing (Otter et al., 2020), language translation (Sutskever et al., 2014) computational physics (Mohan et al., 2023;Raissi et al., 2019).Regarding DA, ML-based algorithms offer new means to learn the governing equations of the dynamics (Fablet et al., 2018;Long et al., 2018) and the associated flow operator (Bocquet et al., 2020;Scher & Messori, 2019), or model correction terms (Farchi et al., 2021), directly from model outputs.Some approaches are even designed to be used in a plug-and-play manner in state-of-the-art DA schemes (Fablet, Chapron, et al., 2021).When considering variational DA, trainable emulators of the adjoint operator of the dynamics (Nonnenmacher & Greenberg, 2021) or directly of the gradient-based DA solver (Fablet, Chapron, et al., 2021) emerged as appealing solutions.Similarly, recent studies have explored learning-based Kalman techniques (de Bézenac et al., 2020).The latter is particularly relevant to address uncertainty quantification.The underlying assumption of the existence of the linear-Gaussian latent space may however restrict their application in real-world case-studies.Generative adversarial networks also naturally arose as appealing ML tools to develop new ensemble DA schemes (Silva et al., 2023).
In this paper, we propose a ML-based approach to consistently approximate by a Gaussian distribution the posterior distribution of the state of a dynamical system given a set of observations.This involves estimating both the mean and the covariance parameters of the Gaussian distribution.Since we are producing probabilistic predictions, the standard mean square error (MSE) is not appropriate as a learning cost.Instead, we choose the logarithmic score as the learning function which is consistent with probabilistic predictions.Our approach relies on a training stage where both true states and observations are available.To circumvent the instability when minimizing the chosen learning function, we constrain our output parameters to be close to an optimum with respect to another cost derived from a VB inference formulation.We prove that the optimum of this cost should be a good first-guess of the minimum of our learning function.Our end-to-end architecture exploits a trainable surrogate representation of the dynamics and a trainable gradient-based solver.It can therefore be considered as an extension of Fablet, Chapron, et al. (2021) to estimate the covariance of the posterior in addition to the mean.To the best of our knowledge, this is the first study which combines a trainable solver for variational DA along with a VB formulation.We claim that our approach could be extended to broader families of posteriors than Gaussian.This paper is structured as follows.Section 2 introduces necessary background on weak-constraint variational DA.Section 3 presents the proposed approach, based on ELBO maximization, and the associated end-to-end neural framework.Numerical experiments on Lorenz 63 dynamics and discharges on Danube river network are reported in Section 4. Finally, concluding remarks are provided in Section 5.  (Sasaki, 1970) states the reconstruction or forecasting of   given   as the minimization of the following cost:

Background on Weak-Constraint Variational Formulation
where, to match notation of Fablet, Chapron, et al. (2021), we have defined   as the following operator Note that in Equation 2, we deliberately omit the background term used to measure the distance to a given background state, which acts as a Tikhonov regularization term on the minimization issue.We made this choice because our approach does not require the explicit use of a background term in a cost function.On the right side of Equation 2, the first term represents the data fidelity term with respect to the observations, whereas the second one penalizes the discrepancy between the state and the underlying dynamics.Tandeo et al., 2018;Trémolet, 2007) to correctly estimate   .Lag-innovation (Belanger, 1974), and Bayesian inference-based methods such as Stroud et al. (2018) and Tandeo et al. (2015) addressed the estimation of these matrices.

Proposed Approach
The minimization of the variational cost of Equation 2 allows to estimate the state x but not to approximate the whole posterior distribution   (𝑥𝑥|𝑦𝑦) .We propose a deep learning scheme which approximates the posterior by a Gaussian distribution.In Section 3.1, we derive a new cost, named stochastic variational cost, to estimate covariances in addition to the mean state.Then, based on the work of Fablet, Chapron, et al. (2021), we introduce a deep learning scheme in Section 3.2 that imposes its outputs to be close to a minimum of the stochastic variational cost.Our deep learning scheme consists of two elements, a neural solver of the stochastic variational cost, and a surrogate model over posterior parameters.Finally, in Section 3.3 we explain how both elements of our approach could be learned jointly from ground-truth data with respect to a logarithmic score.This score allows us to evaluate the quality of the approximation we make to the true posterior.In contrast to Kalman methods (Evensen & Van Leeuwen, 2000), our approach does not rely on the prior computation of the model error covariance matrix.

Deriving Stochastic Variational Cost Through Variational Bayes Formulation
We consider the state-space formulation of Equation 1.In the following,   is a linear operator such that where KL denotes the Kullback-Leibler divergence which measures how two distributions differ from each other, and is given by: ) .
Maximizing the ELBO can then lead to a computationally-tractable maximization of a lower-bound of the likelihood  () (Hoffman et al., 2013).Thus, VB inference consists in maximizing the ELBO with respect to   , so   approximates the posterior distribution.
Let us further assume a Gaussian parametrization for target pdf  (|) and a Gaussian additive noise model for observation likelihood   (𝑦𝑦|𝑥𝑥) .In practice, we set and Following Appendix B, we then derive: up to a function of R.Under the assumption that norm of the posterior covariances is significantly smaller than that of the observation covariance, this term reduces to With regards to the Kullback-Leibler divergence in ELBO expression of Equation 4, an analytic expression is only tractable for some specific priors.By analytic expression, we mean an expression built with well-known operations that lend themselves readily to calculation.For illustration purposes, let assume a Gaussian prior whose pdf satisfies  () = ∏   ∈Ω   (() ;  ) , then we can derive the following analytical expression: In the general case, that is, without assuming any specific form for the prior, we can only state that − KL( ( | )|| ( )) − KL( ( | )|| ( ))is a non-positive function of the approximate posterior parameters  = {() = ((), ()),  ∈ Ω} .Let us call g this non-negative function.To match the generic formulation of the prior term in Equation 2, we consider the following form for  () : where   is an operator on time-series space.
This form is widely used in ML regularizing techniques experimented by Ryu et al. ( 2019); Venkatakrishnan et al. (2013) and referred to as plug-and-play methods for inverse problems.Besides, as detailed in Appendix C, we may note that Equation 6 may be rewritten in this form.Since the prior is left unspecified,   is unknown, and we rely on an estimator  Φ of   to compute   .Overall, from the ELBO formulation, we infer the cost given by As long as  Φ is a valid approximation of   , the minimum of such a cost with respect to   should be a good solution for the posterior approximation.Notice that Equation 8 can be viewed as a variational cost associated with an augmented state space formulation on the posterior parameters, which is why we call it stochastic variational cost.

Proposed Neural Architecture
Within a learning setting, the approximate posterior is parameterized by a set   of weights and biases of a neural network (NN) framework, and is denoted  (|) .Additionally, let us give ourselves an initial state   (0) for the parameters of the posterior approximation, which depends on   .For example, we can choose as initial mean state the linear interpolation between available observations, and as initial covariance matrix the identity matrix.Then, LAFON ET AL.

10.1029/2022MS003446
5 of 17 our approach takes as input the initial state   (0) and the observations   , and outputs the parameters of the target distribution.In our approach,   , as defined in Section 3.1, is a function of   ,   (0) and   .We then write the output of our approach  ( (0) , ) .Note that this implies that each , that we will detail in Section 3.3, which allows to estimate the proximity between the true posterior and its approximation by the target distribution.
The rest of this section is devoted to describing our architecture in Section 3.2.1 and the reasons why we chose it in Section 3.2.2.

Neural Set-Up
Our end-to-end approach is made of two key ingredients: a neural parameterization for the operator  Φ , and a trainable gradient-based solver of the stochastic variational cost defined in Equation 8.  Φ is parameterized as a convolutional NN with specific constraints.The neural solver is a recurrent NN with stacked long short-term memory (LSTM) cells which implements a gradient-based solver for the targeted cost function.As our framework relies on two different components, remark that we can write  = { Φ, } with  Φ the NN parameters of  Φ and   the NN parameters of the solver.From a coding perspective, the proposed neural architecture was implemented using PyTorch framework.Figure 1 shows the working principle of our end-to-end architecture.
Architecture of  Φ . Φ is a convolutional NN with specific constraints, known as Gibbs Energy NN (Fablet, Beauchamp, et al., 2021).More precisely, we have  Φ() = 1•2() .f 2 is a convolutional layer where the central values of all convolution kernels are set to zero such that f 2 (θ)(t j ) does not depend on θ(t j ).f 1 is a convolutional NN which composes a number of convolution layers with rectified linear unit activation, where the kernel size of all convolution layers is 1 along time and space dimensions.In the experiments, f 1 has 3 convolution layers.
Neural solver parametrization.The minimization with respect to θ of the stochastic variational cost (Equation 8) is performed by means of a neural solver.We use a residual NN architecture with LSTM blocks (Schmidhuber & Hochreiter, 1997).Each block is fed on one side with the increment between the estimated parameters at the entry of the block and the input parameters θ (0) , and on the other side by the gradient of the stochastic variational cost with respect to θ applied on the current estimated parameters.This solver optimizes iteratively the estimated parameters.To be more explicit, after k iterations in the LSTM-based solver, the parameters are updated as follow: ) ,  ) , ℎ () ,  () ) , with α a scalar parameter, h (k) , c (k) internal states of the LSTM model and   a linear layer.The number of iterations in the LSTM-based solver has been tuned during experiments and optimal values are comprised between 10 and 20 iterations.

Motivation
Combination of  Φ and the neural solver.Optimizing an inference score S can be very complex, so appropriately constraining the model is a fast and efficient solution to converge quickly to an optimum.We demonstrate in the developments of Section 3.1 that minimizing the cost of Equation 8approximately equates to maximize the ELBO inference cost.The chosen architecture allows to constrain the model by making sure via the learned solver that the output θ ω (θ (0) , y) is close to a minimum of Equation 8. To summarize, we look for the best solution in the sense of inference among the suitable solutions in the sense of stochastic variational cost.The idea of a learned dynamical operator coupled with a learned neural solver was introduced in Fablet, Chapron, et al. (2021).As the formulation of Equation 8 is somehow similar to that considered in Fablet, Chapron, et al. (2021), we adapt their architecture to our case.
Choice of a Gibbs Energy NN for  Φ .From Equation 8, we note that the minimum of the stochastic variational cost with respect to  Φ is reached whenever  Φ is equal to the identity, whatever θ is.Letting  Φ be equal to the identity suppresses the constraint corresponding to the second term on the right-hand side of Equation 8. Thus,  Φ would become a function of μ(t i ) and y.Consequently,  Φ would remain equal to Id, and covariance parameters would remain constant throughout the remainder of the training phase.This has to be prevent since we want to keep optimizing the covariance parameters during training.To this end, the Gibbs energy NN forces the convolutional NN to differ from the identity operator.Additionally, thanks to this constraint parametrization,  Φ can be interpreted as a surrogate model over the mean and covariance parameters of the target distribution.Notice that other choices of NN representation could have been made, such as convolutional auto-encoder.For an intercomparison, we refer to Beauchamp et al. (2020).
Choice of a LSTM for the solver.NNs with LSTM cells belong to the class of recurrent NN.They are particularly suitable for processing sequential data.In our case, our working data is a sequence of time-space series θ (k)  obtained by gradient descent (see Equation 9).LSTM-based updates are the classical parameterization of learned solver schemes (see e.g., Andrychowicz et al., 2016;Hospedales et al., 2021).

Learning Setting
In our experimental setting, we have access during training stage to a data set of true states x = {x (i) , 1 ≤ i ≤ m}, and corresponding observation data set y = {y (i) , 1 ≤ i ≤ m}, with x (i) and y (i) realizations of the discretized setting given in Equation 1.The outputs of our approach θ ω (θ (0) , y (i) ) is composed of means and covariances denoted   ()  () and   ()  () for t j ∈ Ω t , where dependence on y (i) is indicated by upper indices to keep the notation uncluttered.In this context, a commonly used method to evaluate the performed DA approach is the MSE.This criterion measures the distance in the mean square sense between the true state of the system and the average state predicted by the approach.In the case of our approach, this corresponds for a time series x (i) to the following cost: where ‖.‖ 2 is the Euclidean norm.The score of Equation 9 is denoted R-score in the following, which stands for reconstruction score.
However, this metric only allows us to compare the mean of the random vector x|y with the mean of our approximated posterior.This is insufficient if we want to compare our posterior approximation with the whole true posterior distribution.The right framework to assess statistical forecast is through proper scoring rule (Dawid & Musio, 2014;Gneiting & Raftery, 2007;Tsyplakov, 2013).A scoring rule is a function S(q, x) of a pdf q and an outcome x.By extension, we denote  ( ) = ∼()( ) .A scoring rule is, by definition, said to be proper if: It is further strictly proper if the equality holds only for q = p.
Even if the distribution forecast depends on observations as in our approach, using a proper scoring rule is still consistent, as proved by Tsyplakov (2011) and Holzmann and Eulert (2014).In this context the logarithmic score defined by is a strictly proper scoring rule (Dawid & Musio, 2014).That is why we set our training objective L as the minimization of the opposite of the logarithmic score, which leads to: where we have deliberately omitted the constant   2 log 2  .We denote this criterion P-score for probabilistic score in the following.The P-score is also known as negative log-likelihood.Notice that the R-score and the P-score are proportional only when the covariance of the approximate posterior reduces to a constant scalar covariance matrix.The mean R-score and P-score over the whole data set x is given by averaging respectively Equations 9 and 10 over the m couples of true states and observations of the data sets x and y.
The parameters ω of our network are optimized to minimize the P-score by the stochastic gradient descent Adam available in PyTorch.In our experimental learning setting, we set a batch size of 64 and a maximum number of 1,000 epochs.At predefined epochs, the learning rate is decreased.It ranges from 10 −3 to 10 −7 .The parameterization for which the P-score is the lowest on the validation data set is saved.We let the reader refer to the code available online (https://doi.org/10.5281/zenodo.7729564)for additional details on the implementation.

Numerical Experiments
To assess the relevance of the proposed approach, we consider two case-studies: namely, the Lorenz 63 dynamics and an application to a real data set corresponding to the monitoring of Danube river discharges.In the following, our approach will be referred to as 4D-VarnetSto.The baseline approach is the Ensemble Kalman Smoother and will be abbreviated as EnKS.The different approaches will be evaluated against two main criteria: the average P-score (Equation 10) and the average R-score (Equation 9) over the test data set.

Standard L63 Dynamics
The Lorenz dynamics is a system made of the following ordinary differential equations (Lorenz, 1963): (11) We use the following parametrization: σ = 8, ρ = 28, and   = 8 3 .In this setup, the Lorenz 63 system has a chaotic dynamics.A fourth-order Runge-Kutta integration scheme (Butcher, 1996) with 0.01 time step enables us to simulate the time series.Figure 2a is a trajectory of this dynamics for 200 time steps.

Stochastic L63 Dynamics
In order to introduce model noise in L63 dynamics, we use the stochastic framework designed by Chapron et al. (2018).It intends to mimic stochastic behavior in large-scale geophysical flow dynamics.The ordinary differential equation (Equation 11) becomes a stochastic differential equation: dB t is a white noise, formally the difference of a standard Brownian motion.Γ is the new parameter of our model which is fixed to 2 in our experiments.Note that if Γ → ∞, we recover the original model.Figure 2b is a three-dimensional plot for a time series of this stochastic Lorenz 63 version.Adding the model noise strongly deteriorate the smoothness and the convergence to standard Lorenz attractor.

Training Setting and Results
For both dynamics, we consider a time series of 200,000 time steps.From this time series, we create a training set containing 10,000 sub-series of 200 time steps, and validation and test sets each consisting of 2,000 sub-series of 200 time steps.The sub-series overlap within a data set but do not overlap from one data set to another.Observations of the true state are made available solely for the first variable of the system, every 8 time steps, adding a white Gaussian observation noise of variance set to 2.
Including the parameters of the neural solver and those of  Φ , our network has roughly 19,000 parameters to learn.
We train our NN in two stages.First, for each time series, the initial state θ (0) = {μ (0) Σ (0) } is initialized as follows: • μ (0) is the linear interpolation between observations for its first variable and the mean of the observations for the other variables; • Σ (0) is the identity matrix.
We find a first optimum while constraining the estimated covariance matrix to be diagonal.In a second step, we start a new learning session to find a non-diagonal covariance matrix using the previously found optimum as initial state θ (0) .This two-step procedure aims to force the covariance matrix to be definite and positive during the training process.Imposing positive-definiteness directly on the whole output matrix is not an easy task while  12) for 200 time steps of 0.01 length each.
in the diagonal covariance matrix case this is easy to enforce.Indeed, it only requires strictly positive values for the outputs on the diagonal, and zeros elsewhere.So first, we find an optimal diagonal covariance matrix, then we search for a complete covariance matrix by perturbing this optimum.
We compare our method with the EnKS of Evensen and Van Leeuwen (2000).In our experiment, the EnKS has 500 ensemble members and a time lag of 30 time units.No inflation is used.We have chosen a very large ensemble size because we want to be sure to correctly represent the approximation of the posterior made by the EnKS.Indeed, we mainly want to compare the quality of the approximation of the posterior made by the different approaches.For both dynamics, the EnKS is run through 20,000 time steps and evaluated on the last 15,000 time steps to be sure the calibration phase is over.Notice that in the stochastic dynamics case, the model error matrix of the EnKS is a diagonal matrix constant over time which coefficients are obtained by averaging the model error.Thus, in the stochastic case, we expect our approach to approximate the posterior far better than the EnKS as it does not rely on a imperfect model and an approximate model error matrix.Table 1 compiles the results for the appropriate scores.If the first variable is observed for both our approach and EnKS, the 4D-VarnetSto outperforms the EnKS in each score for both dynamics.By adding observed variables in EnKS experiment, the R-score and P-score decrease.For the standard dynamics, the R-score for the EnKS with at least two variables observed become lower than its value in the 4D-VarnetSto experiment, but the P-score stays above.This confirms that our posterior approximation is in any case better that the one proposed by the EnKS.As for the stochastic dynamics, the conclusion are rather similar.The R-score of our approach with one observed variable is better than the one of the EnKS.Again, regardless of the number of variables observed, the P-score is much lower using our approach than using the EnKS, and by even larger amounts than in the deterministic experiment.To conclude with the results of Table 1, we can state that in identical settings, our approach outperforms by far the EnKS in both criteria.Adding observed variables to the EnKS allows to obtain better R-score than our approach but the P-score stays above, which indicates that our approach is better suited for estimating the whole posterior than EnKS.As a side remark, our R-score is similar to the one reported by Fablet, Chapron, et al. (2021) (R-score of 1.34 in Table 1).This is a very good thing, as it indicates that adding complexity to their model does not deteriorate the quality of the state prediction.
Figure 3 compares estimated states (orange curve) and the associated 95% confidence interval (green area) with the real states (blue curve) defined by Equation 11 in the context of standard dynamics.Figure 4 presents the same elements for the stochastic dynamics defined by Equation 12.Both figures represent time series for which the attractor changes its wing.The change of wing is realized when the variables x 1 and x 2 simultaneously go from a maximum to a minimum or vice versa.In Figure 3, the mean state estimated by our approach (top three graphs) and the true state of the system are almost merged.Moreover, the area representing the uncertainty is also very thin but widens for a given variable when an extremum is reached.The uncertainty is slightly larger for the unobserved variables x 2 and x 3 than for the observed variable x 1 .Comparatively, the state reconstructed by EnKS when only x 1 is observed (middle three graphs) coincides less well with the true state.The uncertainty is also larger, especially during the wing change (between t = 50 and 125).When the three variables are observed for the EnKS (bottom three graphs), the real state and the reconstructed state are difficult to distinguish, the area representing the uncertainty is very narrow and widens slightly during the wing change.In Figure 4, we first note that the EnKS with only x 1 observed performs poorly.It does not succeed in correctly representing the dynamics (middle three graphs).When observing the three variables for the EnKS (bottom three graphs), the estimated state becomes accurate.However, the true state curve is almost never contained within the confidence interval.This visually confirms the poor results obtained on the P-score and indicate that the posterior approximation is not accurate.On the contrary, we observe that the confidence interval estimated by our approach seems consistent (top three graphs).The true state curve is globally contained within a fairly narrow confidence interval.11), "Yes" implies stochastic one (see Equation 12).Only the first variable is observed when performing 4D-VarnetSto.In EnKS experiments, from one to all variables are considered as observed.Two benchmark score are evaluated: the MSE of the reconstruction of the true state (R-score, see Equation 9), and the mean of the negative log-likelihood of the predicted parametric distribution applied in true state (P-score, see Equation 10).12).See Figure 3 for details.

Danube River Network for Discharge Measurements
The upper Danube basin is an European river network whose drainage basin covers a large part of Austria, Switzerland and of the south of Germany.Figure 5 shows the topography of the Danube basin as well as the locations of the 31 stations at which daily measurements of river discharge are available.Stations considered as observed or unobserved in our experiment are colored differently.The daily measurements series have lengths from 51 to 110 years.We restrict ourselves to the period 1960-2010 for which all stations have available measurements.This data set have been widely studied in the community of multivariate extremes (see for example Asadi et al., 2015;Mhalla et al., 2020).
This experiment with a real data set aims to meet several objectives.Learning an unknown dynamics and associated uncertainties is challenging.The data-driven models that can be learned lacks important variables (precipitation, snowmelt) to be highly reliable, and consequently encompass high error model.Thus, the ability of our approach to adapt to a high level of model error is studied.Finally, the approximation of the posterior made by our approach is compared to a Gaussian approximation which we call constant covariance approach.In this comparative approach, the mean state is estimated using the approach described in Fablet, Chapron, et al. (2021), and the covariance matrix is a diagonal matrix whose coefficients are constants and set as the variance of the error at each station.
In this experiment, we consider that the observed data correspond to the state of the system.It is equivalent to consider no observation noise, namely   () =  () for each i, where   () is the restriction of x (i) to O T .To avoid divergence of the P-score on the set of observations O t , we modify the P-score slightly by redefining it as follows: where N t is the cardinal of Ω T \O T .Given the spatial dimension of the state, we limit ourselves to output diagonal covariance matrix.Consequently, our NN is trained using only the first step of the process described in Section 4.1.3.The initial state θ (0) is also defined as described in this first step.In order to leave the stochastic variational cost defined, we set R to the identity in Equation 8. Using the criterion of Equation 13, half of the stations are considered to be observed every 4 days (see red locations in Figure 5).We consider time series of 48 consecutive days.For each time series, our goal is to estimate the mean and covariance of the approximate posterior distribution of flow on each day of the time series and at each station, including where observations are missing.In training setting, we assume that some stations are observed (red dots) and the other are unobserved (black squares).We further assume that the observed stations have available observations only once every 4 days.
create 53 time series for validation set and as many for the test set.Note that within a data set, time series are overlapping.Figures 6 and 7 show the estimated mean state (red curve), the confidence interval (green area) and the daily measurements (blue dots) for a summer and winter month, respectively.The stations are identical from one figure to another.
Seasonality plays an important role in discharge analysis, and here, we focus on the summer and winter seasons.
In summer, flows are lower than in winter and subject to important variations in absolute value.This is linked essentially to snow or ice melts at altitude, as well as to episodes of heavy precipitation.For similar reasons, different station elevations, and thus different positions along the river system, were chosen.Stations upstream of the river system have lower flows than those downstream.Flows at upstream stations vary greatly depending on local weather and climate events.
The relative variance estimated by our approach is larger in Figure 6 than in Figure 7.This finding is consistent with the initial considerations about variances in summer and winter.The estimated variance is also more constant in summer than in winter.One can assume that the model error is such that it becomes difficult to detect patterns that would reduce the uncertainty.In winter, on the other hand, the estimated confidence interval varies significantly, and seems to widen at the peaks reached by the flow.We notice that our predictions are sometimes biased for a large number of consecutive time steps.This is particularly true in Figure 7 where a negative bias between the observations and the predicted mean exist.It is visible for downstream and mid-river unobserved stations between 21 January 2001 and 1 February 2001.The presence of available observations drastically reduces the bias.
In order to compare our approach with the constant covariance approach, we average the P-score and R-score restricted to Ω t \O t over the test data set, for both our approach and the comparative constant covariance approach.
As the discharges at different stations have not the same order of magnitude, we rescaled the discharges at each stations to a time series with mean 0 and standard deviation sets to 1 before training both approaches.The scores for the rescaled discharges are given in Table 2.We find that estimating the covariance in addition to the mean state does not degrade the R-score.Indeed the R-score obtained by our approach and by the constant covariance approach are almost identical.Moreover, we significantly improve the P-score over constant covariance approach and we can infer that the variations of variances given by our approach allow a significant improvement of posterior approximation.

Conclusion
Based on previous works which introduced an end-to-end learning framework for variational assimilation problems, we extend this approach to uncertainty quantification.Using a stochastic variational cost derived from an ELBO maximization with respect to a target Gaussian distribution, we have been able to find a Gaussian approximation of the pdf of the posterior.The learning framework comprises a neural-network representation of the dynamics of the parameters and a neural solver for the considered stochastic variational cost.Both solver and dynamics of the parameters are learnt jointly in a context of logarithmic score optimization.This joint learning process offers new perspectives for VB-based cost minimization in DA problems.
Lorenz 63 dynamics and discharges on Danube river networks have been studied.As regards the Lorenz dynamics, our approach captures well the dynamics and the uncertainty.When adding state-dependent model noise, we have been able to retrieve complex type of uncertainty structure.The experiments on the Danube river system provide a setting where the dynamics are unknown, and the data to estimate them incomplete.In this context, our approach allows us to calculate a consistent estimate of the flow, the associated dynamics and the uncertainties.
Our findings also underlines that beyond state-of-the-art results obtained for MSE of reconstruction, our approach is well-suited for logarithmic score.This is a real improvement over reference ensemble methods which suffer from limitations and require careful adaptation to achieve good performance on such score.This indicates that posterior approximation reached with our approach is more consistent than those provide by ensemble methods.
We claim that our approach could be applicable to problems of higher dimension thanks to the versatility of NNs, which could constitute interesting fields of application.Besides, future works will also focus on improving the accuracy of the upper quantile of the predicted distribution.A parameterization of the posterior by heavy tail distribution (see e.g., Resnick, 2007) could be an improvement track.Moreover, as discharges are positive values, a Gaussian parametrization is not ideal to infer uncertainties.More broadly, symmetrical distribution cannot consistently estimate large uncertainty in this problem as it could cover negative flow value.Extending our approach to non-symmetrical distribution would be of interest.
Finally, one limitation of our approach is the need for a data set of true states, which is generally not possible in practice.Thus, there is still significant room for further progress with respect to the application of such approach in operational settings.

Figure 1 .
Figure 1.Proposed end-to-end architecture.Illustration comes from L63 experiment.Given a partial observation piece of data   and an initial pdf state   (0) , the proposed network calculates the optimized parameters   () after   steps in the solver.On the right-hand side, red curve contains the mean state and the blue envelope is a rescaled visualization of the covariance.  () is the difference between the parameters at iteration step  () and at iteration step  ( − 1) .GENN stands for Gibbs Energy NN and ResNet for residual network.

Figure 2 .
Figure 2. Evolution of Lorenz dynamics for (a) standard model (see Equation 11) and (b) stochastic model of Chapron et al. (2018) (Equation12) for 200 time steps of 0.01 length each.

Figure 3 .
Figure 3. Experiments with standard Lorenz dynamics (Equation11).For a set of observations (cyan dots) on given timesteps (light blue dashes on the time axis), the true state (blue curve) and estimated state (orange curve) are plotted for our approach and Ensemble Kalman Smoother (EnKS) with one or all variables observed.The estimated 95% confidence intervals are represented by the green area.

Figure 5 .
Figure 5. Topographic map of the upper Danube basin with the 31 gauging stations.A data set of 50 years of daily measurements is considered (from 1960 to 2010).In training setting, we assume that some stations are observed (red dots) and the other are unobserved (black squares).We further assume that the observed stations have available observations only once every 4 days.

Figure 6 .
Figure6.For a summer month (July 2007), we show the estimated discharge (red curve), the 95% confidence interval (green area) estimated by our method for observed and unobserved stations at different elevations.The daily measurements are also represented according to whether they are available (light blue dots) or unavailable(deeper blue) as inputs.The discharges are expressed in m 3 /s.

Table 1
Scores of 4D-VarnetSto and Ensemble Kalman Smoother (EnKS) for L63 Simulations for Both Dynamics

Table 2
Scores of 4D-VarnetSto and Constant Covariance Approach for Rescaled Danube River Discharges