Sequential Markov Chain Monte Carlo for Lagrangian Data Assimilation with Applications to Unknown Data Locations

We consider a class of high-dimensional spatial filtering problems, where the spatial locations of observations are unknown and driven by the partially observed hidden signal. This problem is exceptionally challenging as not only is high-dimensional, but the model for the signal yields longer-range time dependencies through the observation locations. Motivated by this model we revisit a lesser-known and \emph{provably convergent} computational methodology from \cite{berzuini, cent, martin} that uses sequential Markov Chain Monte Carlo (MCMC) chains. We extend this methodology for data filtering problems with unknown observation locations. We benchmark our algorithms on Linear Gaussian state space models against competing ensemble methods and demonstrate a significant improvement in both execution speed and accuracy. Finally, we implement a realistic case study on a high-dimensional rotating shallow water model (of about $10^4-10^5$ dimensions) with real and synthetic data. The data is provided by the National Oceanic and Atmospheric Administration (NOAA) and contains observations from ocean drifters in a domain of the Atlantic Ocean restricted to the longitude and latitude intervals $[-51^{\circ}, -41^{\circ}]$, $[17^{\circ}, 27^{\circ}]$ respectively.


Introduction
Consider a state-space model comprising of two elements, an unobserved continuous time stochastic process (Z t ) t≥0 with Z t ∈ R d and a sequence of observations (Y t i ) i≥1 taken at a sequence of known time instants (t i ) i≥1 .We will assume each Y t i is a random variable depending on the evolution path of the unobserved process since the last time instant (Z t ) t∈[t i−1 ,t i ] .(Z t ) t≥0 and (Y t i ) i≥1 are combined in a joint stochastic model and the objective of filtering is to estimate the unobserved state Z t given the observations up to that time (Y t i ) t i ≤t .Such problems occur routinely in applications in data assimilation such as numerical weather prediction, oceanography, finance and engineering, see [10,19,22] for example applications and [3,8] for book length introductions.
The problem of filtering is to approximate the conditional distribution of each Z t i given (Y tp ) p≤i also known as the filtering distribution or simply the filter.This is a challenging task as in most cases of practical interest, with the exception of linear model observations and discrete-time, linear signal dynamics, the filter is not analytically available and hence one often has to resort to numerical approximations.There are a plethora of techniques that have been presented in the literature, but perhaps the only two exact approaches (in the sense that the accuracy can be made arbitrarily high, subject to computational cost) are based upon particle filters (PF) and Markov chain Monte Carlo (MCMC); see e.g.[3,8,13].PFs and MCMC are simulation (or Monte Carlo) based approximations of the filter such that as the the number of samples grows one can recover the exact filter.PFs are specifically designed for filtering and have a fixed computational complexity per-time-step update.
Whilst traditional iterative MCMC can be used for filtering at a fixed time, the cost grows at a linear order in the time step at each filtering update and is not often used for this task.Instead, if one wishes to explore this as an alternative direction, a more practical approach is to use sequential MCMC chains that target the filter at each time and also use the filter update equations.Motivated by the challenges faced in high dimensional filtering and Lagrangian data assimilation we revisit a less popular computational methodology initially proposed in [4].We note that this method is provably convergent [28] and has been applied successfully in challenging filtering problems in the past such as for point process models [11,28] and group tracking [9,26].
The problem of high-dimensional filtering is even more challenging.For instance in numerical weather prediction models d is in the order of millions.Unfortunately simple or standard designs of PFs require an exponential cost in the dimension d to achieve a certain precision and hence are impractical for very high dimensional applications.Several methods [5,7,6,23,31,32] based upon a combination of sequential Monte Carlo (SMC) (e.g.[14,13]) and MCMC can be adopted.For well-designated classes of models, they have been shown to be both mathematically and empirically able to deal with high-dimensional filtering in a cost that is polynomial in the dimension of the problem.We note, however, that these methods are not a universal solution for the high-dimensional filtering problem due to the substantial computational cost involved.Several approximate (but biased) methods such as the ensemble-Kalman filter (EnKF), ensemble-transform Kalman filter (ETKF), error-subspace transform Kalman filter (ESTKF) and their localized versions can be found in the literature, which despite being biased are the most popular and widely used methods for high-dimensional filtering, due to the very low computational cost relative to SMC-MCMC methods.
Motivated by problems in oceanography, we consider an even more complex high-dimensional filtering problem.In this case the observers travel on a spatial domain, such as a 2-dimensional grid, and their location is unknown and driven, deterministically, by the signal (Z t ) t≥0 that is used to model the velocity field (among other variables) on the domain of interest.This is also known in the literature as Lagrangian Data Assimilation and this observation mechanism with unknown observer locations induces an extra-level of complexity versus the traditional data assimilation problem.The introduction of dependence of the spatial locations of observation on the signal yields a long-range (in time) dependence on the signal, that is not present in the classical filtering problem; the details are given in Section 2. Treating the position of observers as unknowns is a new challenge for filtering that has not been addressed in earlier works and the ensemble methods mentioned above will not be accurate for this new and interesting problem.EnKF type methods are known to struggle with nonlinearities induced by Lagrangian observers even when the locations of the drifters are well known [1] and the situation is much worse in the unknown location case.This was confirmed by extensive preliminary numerical simulations leading to this paper, which in addition showed that the computational cost required to implement SMC methods is very high due to the large number of tempering steps required.This motivated extending a sequential MCMC method developed for the filtering of point processes [11], for this new type of filtering problem.The method of [11] has been shown in [28] to be a theoretically justified method for filtering (in the sense that the estimate will converge to the true filter as the number of samples grows to infinity) and seems to be particularly amenable for the filtering problem in high-dimensions, with a cost per update step depending on the cost of evaluating the transition density of Z t ; e.g. for a homogeneous Gaussian transition density this is between O(d) and O(d 2 ) depending on its covariance matrix structure.Also, in cases of regular grid and stationary Gaussian noise, costs can be O(d log(d)) even for dense covariance matrices by exploiting the Toeplitz structure of matrices involved in the calculations and use fast Fourier transform (FFT), see e.g.[15].The known convergence guarantees cover the algorithm described later in the setting of known observation locations.In the case of unknown locations, the algorithm includes an extra approximation to address the dependence of the observer locations and the next target distribution to the random evolution of the state dynamics.Thus, for this case a complete analysis warrantees further work that is beyond the scope of the current paper.
The contributions of this article are as follows: 1. We formulate a new filtering problem with spatial Lagrangian observations at unknown locations.We develop, based upon [11], a generic sequential MCMC method that is effective for high-dimensional and nonlinear models.We will often use the abbreviation SMCMC for the method in the sequel.
2. We demonstrate the performance of this method in two ways.First we use tractable Linear Gaussian state space models and compare in terms of accuracy and execution speed with ensemble Kalman filter methods (EnKF, ETKF, ESTKF and Localized EnKF) and show a significant improvement, especially when a higher accuracy is required.Then we present a challenging realistic high dimensional data assimilation case study for which ensemble methods fail due to the nonlinearities present in the model and observation scheme.We use a rotating shallow water model and observations obtained both at known and unknown spatial locations.Our simulation scenarios are realistic and constructed using real data from NOAA.
This article is structured as follows.Section 2 describes the class of models considered in this article.Section 3 presents the algorithms adopted for our problem of interest.Section 4 demonstrates the methodology on several examples.Finally, in Section 5 we conclude the paper with a discussion.

Modelling Preliminaries
We consider an unknown continuous time stochastic process (Z t ) t≥0 , with Z t ∈ Z ⊆ R d , for which one has access to partial information via observations arriving at times, At any given time t k ∈ T we assume there are a fixed number of d y ∈ N observations obtained from N d ∈ N observers or drifters with the j th drifter's observation denoted by y j t k ∈ Y and all observations collected in the vector y t k ∈ Y N d ⊆ R dy .These observations are taken at spatial locations x j t 1 , x j t 2 , . . ., x j t k ∈ X ⊆ R s , where s is typically 2 or 3.The collection of spatial locations of all observers at an observation time t k ∈ T is written as a vector We adopt a continuous time modelling approach for Z t motivated by applications such as atmosphere and ocean sciences.In these topics physical quantities such as wind or water velocity are modelled by continuous time space varying physical models comprising of systems of partial differential equations (PDEs).To allow for model uncertainty we need to incorporate stochastic dynamics for Z t , for which the noise can be either added continuously (as in stochastic PDEs) or discretely in time (e.g.see Example 2.1).Our framework requires that at the discrete time instances in T, (Z t k ) k≥1 forms a discrete time Markov process with a known and well defined (positive) transition density.For A ⊆ Z we shall assume that the transition dynamics for (Z t ) t≥0 can be obtained as where P denotes probability.In the notation for f k the subscript is included to account for possible time-inhomogeneous structure of (Z t ) t≥0 , or dependence on the time increments and Z 0 is taken as known.We will assume that, at the very least, f k (z t k−1 , z t k ) or a suitable approximation of it can be evaluated pointwise; examples include stochastic differential equations (SDEs) and their various time discretization approximations and similarly stochastic PDEs or PDEs with discrete time additive noise (see Example 2.1 below).
Example 2.1 (PDE with discrete time additive spatial noise).We present an example which will be considered often in the paper.We will consider Z t to be a vector containing hidden state variables at positions defined on a bounded domain with known boundary conditions.Consider Z = R d .Let Φ(z s , s, t), Φ : Z × (R + ) 2 → Z, be the solution of a PDE with initial condition z s run from time s to t with 0 ≤ s < t.Then, an example of our model would be for k ∈ N, Gaussian random variables of zero mean and covariance matrix Q.In such scenarios, the process in continuous time is a PDE that is perturbed by noise at discrete times defined when the observations arrive and f is a Gaussian density of mean Φ(z t k−1 , t k−1 , t k ) and covariance matrix Q.Note that, in practice, one may have to replace Φ with a numerical approximation of the solution of the PDE.

The Standard Lagrangian Observation Model
We will assume that each observation vector Y t k depends only on Z t k and there is a positive conditional likelihood density, i.e. for k ∈ Note that in the standard model we discuss here, the position of the drifters is assumed known and for brevity their effect is captured simply by the subscript k in g k .
Filtering and Smoothing Inference for the hidden state is performed using conditional distributions given the available data.One can either consider the whole path trajectory or just the marginal For k ∈ N we define the smoothing density For k ∈ N, we are interested in estimating expectations with respect to (w.r.t.) the filtering distribution (or simply the filter) where π k (z t k ) is the marginal in the z t k co-ordinate of the smoothing distribution and φ : Z → R is π k −integrable function.It easily follows that the filter density can be obtained recursively We note that the filtering problem is discrete time in nature due to the observations arriving at discrete times.One can still obtain P(Z t |(X tp ) p≤k , (Y tp ) p≤k ) for t ∈ (t k , t k+1 ) by integrating π k−1 with the corresponding transition density (and applying Chapman-Kolmogorov equations).

State Space Models with Unknown Observer Locations
We proceed to extend the observation model to allow the spatial locations where observations are obtained to depend on the state process.For example, in Lagrangian Data Assimilation in oceanography the unknown ocean velocities will directly affect the drifter's motion.In particular, we now assume that for j ∈ {1, . . ., N d } for some function h : X × Z → X.We then modify the observation process, as originally considered in (1), where G((z, x), •) is a probability density on Y N d ; we note that G can depend upon the time parameter, but we do not add this to the notation for simplicity and to distinguish with the known location case.This observation model requires simulation of x t k in parallel to z t k .Note that (x t k ) k≥1 is a deterministic function of (z t ) t≤t k and does not contain any additional information, but is required for the purpose of computing G and needs to be propagated in the recursions for the dynamics.Compared to the model in (1), one has here that G ((z Filtering and Smoothing The filtering and smoothing recursions are analogous to the known observation location case and (2)-(3) in Section 2.1 where G replaces g k .

Discussion on the Choice of the Computational Methodology
It should be clear from the previous discussion that since then via the presence of x t k the likelihood function G((z t k , x t k ), •) will depend upon the complete path realization of the hidden process (Z s ) 0≤s≤t k up to time t k .This additional long-range dependence is the barrier to using some of the existing particle filtering methods listed in the introduction.To explain this further we will first consider the case where one augments the state of the state space model with (x t k ) k≥1 and apply a standard PF, which is a sequential simulation method that propagate samples by combining importance sampling for (2) and resampling steps.Even for d = 1 the standard PF, which is generally a reasonable method in that context, would be destined to fail, due to the well-known path degeneracy phenomenon [24].This is a consequence of successive resampling steps that cause a lack of diversity in the samples representing the complete path (Z s ) 0≤s≤t k and approximating the corresponding smoothing density Π k .On the other hand PFs can approximate very accurately the filter π k and this relies on the stability properties of the filter itself, which in turn requires reasonable mixing in the hidden state and a non degenerate likelihood function G [13].When augmenting hidden state dynamics with static parameters and implementing PFs, the deterministic components in the state vector cannot mix.Even when MCMC kernels are used as artificial dynamics they will depend on the complete path of the hidden state and path degeneracy will result into high Monte Carlo variance [24].For this model with the deterministic dynamics x t k an online PF implementation (with fixed computational cost per time) will suffer from both issues described: path degeneracy and lack of stochastic mixing.This is due to (X t ) t≥0 being included in the information of (Z t ) t≥0 and this justifies a separate treatment and algorithm design for this class of state space models.A more detailed theoretical study of filter properties for this model and its consequence for numerical methods is left for future work.However, there are some exceptions where the joint dynamics (X t , Z t ) t≥0 may display adequate mixing in continuous time as in some hypo-elliptic SDEs that can result in an ideal f k with sufficient mixing.Note that when time discretization is performed in these models the mixing in (X t ) t≥0 will deteriorate or vanish unless recent advanced numerical schemes are used, e.g. as in [29,20,21].An alternative course of action is to aim for methods that aim to approximate π k without using importance sampling and resampling for Π k and (2).The method in [32], which was designed for high-dimensional filtering, has been our first attempt to perform filtering for this model.The method transports particles from a variant of (2) that only considers the path between t k−L+1 to t k for a small lag L and thus bypasses the degeneracy issues by introducing a moderately low bias.However, we found in extensive preliminary simulations, that the (computational) time to run such a methodology was significant and we aimed to lower the computational cost.Our investigation focused on efficiently approximating (3) directly via a sequence of MCMC chains initialized from a previously obtained approximation of π k−1 .At time t 1 the algorithm targets π 1 (z t 1 ) exactly by running an MCMC kernel with invariant distribution π 1 (z t 1 ) for N steps, and at later times t k , k ≥ 2, it targets an approximation of the filter distribution obtained by replacing where δ {Zt k } (z) is the Dirac delta measure centered at Z t k , and Z (1) t k−1 are the MCMC samples obtained in the preceding time step with the superscripts denoting MCMC iteration number.The method proved particularly effective and efficient in high dimensional problems and is trivially parallelizable.
3 Sequential MCMC for Lagrangian Data Assimilation

Standard State-Space Models
We begin by detailing the SMCMC method in [11] in the case of the standard state-space model (with the positions of the floaters assumed to be known).Our setting differs from the one for which the method was originally introduced.The approach is based upon the well-known predictionupdating structure that is given in (3).
The idea is to initiate the method with an MCMC algorithm associated to an MCMC kernel of invariant measure Note that we have assumed here for simplicity that z 0 is fixed and known.There are many such kernels and in this article we focus on the random walk Metropolis (RWM) kernel exclusively.Such a kernel costs O(d) per step and requires one to be able to compute f (or an approximation thereof).We note that RWM is a standard and popular generic choice of flexible kernel and in our numerical implementations shown in the sequel such a choice indeed provided an effective overall algorithm.The MCMC kernel is run for N steps producing N samples and an N −sample approximation of π 1 (φ) given by where z t 1 is the i th −sample of the Markov chain.At a subsequent time point k ≥ 2, using (3).If one has access to an N −sample approximation of π k−1 (z t k−1 ), then instead of sampling from π k (z t k ) directly, which can be impossible, one can consider the approximate target density One can then use any MCMC kernel (e.g.RWM) with invariant measure π N k .Running the kernel for N steps yields an N −sample approximation of π k (φ) as The method is summarized in Algorithm 1 (see Appendix A for a detailed pseudocode).We note that in our implementations, we initialize the MCMC chains (at time k ≥ 2) by picking one of the samples from π N k−1 uniformly at random.The convergence of Algorithm 1 has been first discussed in [28].[28,Proposition 1] gives ), hence almost sure convergence of the estimators π N k (φ) holds.It is important that the MCMC kernel possesses effective ergodicity properties and the better the kernel mixes, the better the approximations π N k will be.The method as presented in Algorithm 1 has a cost of O (κ(d)N 2 + κ y (d y )N ) per update step, where κ(d) and κ y (d y ) are the costs for computing f k and g k respectively.The O(N 2 ) part of the cost can easily be reduced even to O(N ) using subsampling of the samples used in (7)

State-Space Models with Unknown Observation Locations
The approach detailed in the previous section is not straightforward to extend to the model with unknown observer locations in Section 2.2.The first issue is the integral in ( 5) is generally an intractable formula to compute.One can replace the time integral by a simple Euler approximation with step size which still depends on a discrete path of the unobserved process.Note that more complex time discretization schemes can be used to improve numerical stability and accuracy relative to the simple Euler approximation.
The next issue has to do with setting an appropriate target distribution for the MCMC sampler analogous to (7).At time t k−1 , we have simulated samples {z (m) to plug in such an expression but the likelihood G requires setting a single value for x t k−1 (as we can only have one target distribution for the MCMC chain).So we make one final approximation, and use a predicted value following: with the expectation taken w.r.t.Xj and can be approximated using plain Monte Carlo and propagating the dynamics of Z t jointly with (8).This will provide an approximation of the spatial locations in this manner for the next step without the need of elaborate simulation methods (like MCMC) that consider all of these discretization points.The main requirement is as before that the dynamics of Z t can be sampled, at least up-to an approximation.To illustrate this we present the procedure in Example 3.1 below for the case of Example 2.1.
We now proceed to present the analog of ( 7) appealing to the prediction-updating structure in (3).The target distribution for the k-th MCMC procedure will be Running the MCMC kernel for N steps yields an N −sample approximation of The procedure is summarized in Algorithm 2.
We remark that Algorithm 2 will have an intrinsic bias as it will (asymptotically) in N approximate π k and not the true filter.When the cost of the MCMC kernel is O(κ(d)) per iteration, then the cost of Algorithm 2 is O(κ(d)N 2 + LdN ) per time step and the quadratic cost in N can be removed as discussed previously to obtain a cost of O((κ(d) + Ld)N ).Example 3.1 (Example 2.1 continued).To compute x j t 1 , this simply comprises of running the dynamics for l ∈ {0, . . ., L − 2} and then (in parallel if possible) computing x j t 1 using the recursion for (l, with initial condition x j,(r) t 0 = x j t 0 = x j 0 and setting Then one runs a MCMC kernel with invariant measure: .
The MCMC kernel is run for N steps and an N −sample proxy of π where z t 1 is the i th −sample of the MCMC chain.At a subsequent time point k ≥ 2, we now need to approximate the recursion (9).This can be achieved, by running for (l, r) ∈ {0, . . ., L − 2} × {1, . . ., N } Z (r) and then for (l, j, r) with initial condition x j,(r) , before finally obtaining

Numerical Simulations
We now illustrate the performance of Algorithms 1 and 2 in four different cases: 1.A tractable linear Gaussian model that is fully observed.The implementation of Algorithm 1 has not been investigated for high-dimensional filtering problems and we will show both efficiency and accuracy compared to competing ensemble methodologies.In contrast to how practitioners use the latter, we use moderately high number of ensemble sizes (500-10 4 ) noting that for this model these methods are consistent and increasing ensemble size improves accuracy.

2.
A tractable linear Gaussian model that is partially observed on a two-dimensional grid.In this model the dimension of the data is smaller than that of the unobserved hidden state.
Here, we compare the SMCMC method against a local version of the ensemble Kalman filter where the physical domain is partitioned into subdomains and local updates are performed independently on each subdomain (in parallel).
3. A rotating shallow-water model observed at known locations.We use NOAA data to set the initial conditions and boundaries and then simulate Z t , x t to provide observations.The point is to assess Algorithm 1 using synthetic drifter locations and observations, but we note the simulation scenario is set using real data from NOAA to make the case study as realistic as possible.
4. A rotating shallow-water model observed with unknown locations.We use real data for observer positions and velocities and show that Algorithm 2 is effective at estimating both the unknown velocity fields and observer locations.
It is worth noting that in Cases 1. and 2. the true filter is known and is obtained through the Kalman filter (KF), however, in Cases 3. and 4. the true filter is unknown, and therefore, in these two cases we compare our results to a reference that is chosen to be the hidden signal used to generate the observations in Case 3. and the prior distribution in Case 4. estimated using 50 different simulations of the shallow-water dynamics with noise as will be described below.The convergence of the MCMC iterations is checked using standard diagnostics (trace and autocorrelation function plots, Gelman-Rubin diagnostic values), which for the sake of brevity are not reported here.

Linear-Gaussian Model: SMCMC vs Ensemble Methods
We begin with Algorithm 1 for a Linear and Gaussian model in discrete time (recall that this is related to the standard model described in Section 2).Consider the model where T ∈ N, L ∈ N is the time frequency at which the system is observed, i.e.
is a square matrix where the maximum eigenvalue is less than or equal to one, C ∈ R dy×d is defined as where r is the spatial frequency at which the coordinates of Z mL are observed (e.g. if r = 3, we only observe the 3 rd , 6 th , • • • coordinates of Z mL .) We will compare Algorithm 1 with the mean of the KF, EnKF, ETKF and ESTKF.See [33] for the pseudo-codes.For the EnKF we use the matrix inversion lemma (Sherman-Morrison-Woodbury formula) when d y is larger than the ensemble size (see [30,27] for more details.)In the implementation of the KF, the computation of the Kalman gain involves inverting a matrix that is known as the pre-fit residual.It is important to note that for all the ensemble methods mentioned in this article, we do not directly compute the inverse of a matrix if it is being multiplied from left or right by another matrix or vector.Rather we reformulate the problem as finding the solution(s) of a linear system(s).For example, if we have X = A −1 B (or X = BA −1 ), we solve for X in AX = B (or in A T X T = B T .)

Simulation Settings
For the simulations we set T = 500, L = 1, r = 1 (the system is fully observed), , and σ z = σ y = 0.05, then compare the algorithms for different values of the state dimension d.The absolute errors are defined as the absolute difference of the mean of the KF and the means of the other filters at every time step n and every state variable.We record the machine time once we have 70% of the errors below σ y /2.We note that there are other possible metrics that could be used to evaluate the performance of the methods, but the chosen metric here is to log the machine time when 70% of the absolute errors fall below a threshold value that we take it to be σ y /2 (other threshold values are possible).

Results
Table 1 displays the numerical results of our implementation.The table shows the state dimensions, the percentage of the absolute errors below σ y /2, the size of ensemble/particles (plus N burn for SMCMC (Algorithm 1)), the number of independent simulations that were run, the machine time and the ratio of the machine time of ensemble methods w.r.t.SMCMC.Remark 4.1.Simulations were run on a Precision 7920 Tower Workstation with 52 cores and 512GB of memory.We note that in our method matrix multiplication is very limited and there is no matrix inversion.Each run of any of the ensemble methods is allowed to use all 52 cores for matrix multiplication through multi-threading, while each run of our method only uses one core, however, we run 26 independent simulations of SMCMC (for the model in Section 4.1) in parallel and then average the results.
In Figure 1, we plot the machine times for each method.In Figure 2, we fix the size of ensemble/particles to N = 1000.We choose N this way such that at least 50% of the absolute errors of each method are less than σ y /2.The plot shows that even when the ensemble size is low and fixed the SMCMC method is still dominating especially as the state dimension grows to high values.The results indicate that Algorithm 1 is superior to several established data assimilation methods, particularly in high-dimensional scenarios.For example, when d = 16000, achieving a success rate of 72% in reducing errors below σ y /2 using the EnKF is almost five times as expensive as using the SMCMC method.Additionally, the cost for achieving the same level of performance is almost eight times greater for the ETKF and ESTKF methods.These findings provide clear evidence for the superior performance of Algorithm 1 (when compared to non-localized ensemble methods) in high-dimensional data assimilation problems.

Linear-Gaussian Model: SMCMC vs Local EnKF Method
Consider the same model as in (16)  local version of the EnKF method.Typically, localization is implemented in two ways: explicitly, by considering only observations within a specific region surrounding the analysis location, or implicitly, by modifying the analysis state so that observations beyond a certain distance do not impact it.The application of localization is an ongoing area of research, and numerous variations of localization schemes have emerged in the past decade (see [33] and the references therein).The localization technique used here is an explicit technique and reffered to as the R-localization (R refers to the observational error covariance matrix).The basic idea of R-localization is not to use all the grid points when updating the ensemble members but to perform a domain localization where the domain D is partitioned into Γ subdomains {D i } Γ i=1 , then, to only use the observations within a specified distance (the localization radius) around the local domain D i .This defines a linear transformation D i , which transforms the observation error covariance matrix, the global observation vector Y m , and the global observation matrix C to their local parts.Then, an observation localization is performed by modifying the observation error covariance matrix so that the inverse observation variance decreases to zero with the distance of an observation from an analysis grid point.The modification is done by dividing the diagonal elements of R by some weights that are a function (e.g.Gaspari-Cohn function or an exponential function) of the distance of observations from the analysis grid point (for more details see [33]).The updates on the local domains can be done independently and therefore in parallel.Note that the model does not require to enforce spatial smoothness/continuity on the hidden state (in constrast, e.g., to the Rotating Shallow-Water model used in the sequel), thus the localisation approach permits jumps between neighbouring subdomains reconstructed by ensemble members.

Simulation Settings
Let T = 100, L = 1, r = 4 (that is, only one-fourth of the grid points are observed, i.e., the observed coordinates are 4, 8, 12, • • • ).The matrix C is the same as in (17), A = −0.95Id , σ y = σ z = 0.1, and we set Z j 0 ∼ −0.15 × U [0,1] for all j ∈ {1, • • • , ⌊d/3⌋} and Z j 0 = 0 for the rest of j's.First, we test all algorithms (SMCMC, EnKF, ETKF, ESTKF and local EnKF (LEnKF)) when d = 1600 and d y = 400 and plot the errors in histograms.Then we change the values of the state dimension d (and hence d y ) and compare the SMCMC algorithm against the local EnKF, with R-localization, in terms of the cost for fixed accuracy by recording the machine time needed by both algorithms once the percentage of absolute errors less than σ y /2 is 70%.We also compute the cost when the percentage of absolute errors less than σ y /2 is 77%.We note that Algorithm 1 is run M times in parallel on a 52 cores machine.As for the local EnKF, the parallelization is done over the Γ subdomains since the updates on the local domains are independent.We note that the number of subdomains Γ is chosen between 15 and 55 such that it divides d.

Results
In Figures 3-6, we plot the means of the filters when d = 1600 and d y = 400 for observed and unobserved state coordinates.Figure 3 shows the means of all filters for the unobserved state variable Z (2) n , where we can see that the non-localized ensemble filters are very inaccurate compared to the truth.In Figure 4, we omit the non-localized ensemble filters and keep only LEnKF and SMCMC to show that both methods are doing better than the rest when the state variable is unobserved.Figure 5 shows that LEnKF gives a more accurate estimate for the observed state coordinate Z (4) n compared to SMCMC filter, however, Figure 6 shows the opposite when the observed coordinate is Z (20) n .This is also illustrated in the histograms in Figure 7.As we can see from the histograms, the errors values are much higher for the non-localized ensemble methods because they fail to accurately estimate the filter for the unobserved state variables.Next we compare between SMCMC and LEnKF in terms of accuracy and cost against the state dimension d.In Figure 8, we plot the cost, measured in seconds, of both methods in two cases: First, when 70% of the absolute errors are less than σ y /2.Second, when the requirement for accuracy is higher, that is, when 77% of the absolute errors are less than σ y /2.To get a higher accuracy we increase the ensemble size for LEnKF and the number of independent repeats M for   SMCMC (while fixing N to 500 and N burn to 100).In the first case, LEnKF runs slightly faster when d = 1600, 2500, 4900, 8464, 12100.However, when d becomes larger, i.e. d = 16384 & 20164, SMCMC runs faster.As for the second case, we had to use a large ensemble size for LEnKF to make the accuracy higher (for example we used N = 6000 ensemble members when d = 20164), which resulted in making LEnKF much more expensive to run than SMCMC (we run M = 350 independent simulations, each with N = 500, when d = 20164).

Rotating Shallow-Water Model Observed at Known Locations
We consider a signal of the type in Example 2.1, where the PDE is associated to the shallow water equations (SWEs).The SWEs we consider are as follows g is the gravitational acceleration, f 1 is the Coriolis parameter that is assumed to be varying linearly with y such that f 1 = f 0 + β(y − y 0 ), where f 0 = 2Ω sin ψ 0 , Ω = 7.29 × 10 −5 sec −1 the rate of earth's rotation, ψ 0 the central latitude of the region under study, y 0 is the y-value at ψ 0 , and β the meridional gradient of Coriolis force at ψ 0 .Here η represents the depth of the water (sea free surface to sea bottom), H is the bathymetry which is defined as the distance from the geoid to the sea bottom (positive downwards), ζ is the elevation of the free surface measured from the geoid (positive upwards), therefore, η = ζ + H, u and v are the horizontal velocities in x and y directions, respectively.The boundary conditions will be provided by the actual oceanographic data (based on a separate data assimilation procedure) and will be time varying.We use the finite-volume (FV) solution of the SWE [25,34] which comprises of a 2-stage Runge-Kutta method combined with a local Lax-Friedrichs FV scheme, with time step τ k = (t k − t k−1 )/L, hence, t k = kL, with t 0 = 0 and its output will consist of (Z t ) 0≤t≤T , for some time T > 0. The details are presented in Appendix B. Let N x , N y ∈ N be the number of the cells in the grid in x and y direction respectively with ∆ x , ∆ y the corresponding step sizes.The hidden signal at time , where Z 0 is known and (η t i ) 1≤i≤NxNy , (u t i ) 1≤i≤NxNy , (v t i ) 1≤i≤NxNy are row vectors obtained from the approximate solver detailed in Appendix B, Z := R 3NxNy .
At prescribed times (t k ) k≥1 we add Gaussian noise, (W t k ) k≥1 , to the output of the numerical solution of the PDE.To preserve the boundary conditions the noise is constructed such that it is zero at the boundary.In particular, for η t k and k ∈ N, we use and similarly for u t k , v t k design Ξ u t k , Ξ v t k and then vectorize to get Here J ∈ N is a user chosen number of Fourier modes, ϵ ∼ N (0, σ 2 /(i ∨ j + 1)), for i, j ∈ {0, • • • , J − 1}, where i ∨ j here means the maximum of {i, j}, and σ > 0, see Appendix C for the specific implementation approach.In this example we have that d = 3N x N y and matrix-vector calculations have been carried out at a cost of O(d 2 ).We note that the overall algorithm could have been implemented using FFT at a cost O(N Ld log d) per observation time with L being the total time discretization steps used in each k as per Example 3.1.
The observations in this model are obtained from a set of N d drifters in the region of study that are assumed to be moving according to the velocity components of the solution of the SWE above.The observations (and their locations) are generated from the signal before running the filtering algorithm.The observation model is taken as where O t k : Z → Y N d is an R dy -vector valued function containing measurements of (u t k i ) and (v t k i ) from the signal Z t k at time t k (we do not measure η t k ).These are collected from drifters whose positions move according to a kinematic model, where in (8) we use the values from the velocity fields at each location and therefore set h( Due to the space discretization involved in the PDE, in practice the observation location is chosen as the closest point on the grid at each time.In Figure 9, we present the basic idea of locating the closest point on the grid via an illustration.First we locate the four grid points surrounding each drifter based on its location (according to (8) in red).Then we pick the closest grid point to each drifter.The set of picked grid nodes will correspond to the approximate spatial locations of the observations (see Figure 9) and similarly for the set of observations are the values of (u t k i ) and (v t k i ).Compared to the ideal physical quantities this observation model contains a discretization error that approaches zero as ∆ x , ∆ y → 0.

Simulation Settings
The region of simulation is a domain of the Atlantic Ocean restricted to the longitude and latitude intervals [−51 • , −41 • ], [17 • , 27 • ], respectively.We use Copernicus Marine Services [12] to obtain the bathemetry H, the sea surface height above geoid η, and the horizontal velocities u & v, with a 1/12 degree horizontal resolution, at an hourly rate from 2020-03-01 to 2020-03-05.The values of η, u & v at time 00:00 2020-03-01 is used as an initial state z 0 .As for the boundary conditions, we also use the values of η, u & v (at the boundaries) interpolated in time.For the observations, we assume that there are N d = 12 drifters in the region during the period of simulation that observe u and v.We obtain the drifters initial locations from the set of data available in [17,16].Drifters stay at the surface of the water and move with surface currents and their positions are tracked every 10 minutes to 1 hour (depending on the programme) and they can move up to 2m/s, so up to 100km a day on e.g. the Gulf Stream, but typically indeed it is much less (< 10km) especially away from the boundary currents.They provide a Lagrangian horizontal velocity data near the surface (thus

Results with synthetic drifter locations
In Figure 10 we show a histogram of the absolute errors, defined as the absolute difference between the values of the filter and the hidden signal (from which the data is generated) at all state variables and all times.Furthermore, in Figure 13 we present a snapshots of the hidden signal, the filter and their difference after 32hrs.Also shown in the snapshots the tracks of the drifters which were computed before implementing the filtering algorithm.The ratio of the number of observations to the number of state variables is 24/43923 = 0.055%.Even with a small number of observations, the results show that the filtering technique presented here is quite effective.Finally, we note that these simulations took about 67 hours to run 26 independent repeats on 52 cores.

Rotating Shallow-Water Model Observed at Unknown Locations
Here we consider the same model as in the previous section except that it is assumed that the spatial locations of the observational data are unknown and we use real drifter data from NOAA.For t k ∈ T, the set of observations is the measurements of u and v obtained by the drifters in the region of simulation at time t k .To evaluate the function G ((z t k , x t k ), y t k ), which is a Gaussian density with mean O t k (z t k ) and a covariance matrix σ 2 y I dy , at the given observations at time t k , we need to determine the mapping O t k : Z → Y N d .This is done the same way as in the previous example except now we use the estimate of the mean spatial locations of the drifters as in Example 3.1.

Simulation Setting
The simulation region is the same as it was in the preceding example.[17,16] provided the data used in this analysis, which showed the presence of 12 drifters in that region during the simulation period along with hourly measurements of u and v that we also extrapolated over time.The mean error of the measurements during the simulation period is computed over all times and all drifters and is denoted by σ y = 0.0145, whereas the minimum error value is 0.0012 and the maximum error value is 0.0375.Here, we should point out two ways in which the data in this example differs from that in the preceding example: i) The data in this case is considered to have been obtained at unknown locations, whereas the data in the previous example was taken at known locations.ii) Second and most importantly, the data in this example is real measurements, whereas the one in the other example is synthetic.Then, using the same parameters as before we run 26 independent simulations of Algorithm 2 in parallel.

Results with real data
In Figure 14 we show a histogram of the absolute errors, defined as the absolute difference between the filter mean and the reference signal.The histogram shows that 93.6% of the filter values are within σ y /2 from the reference signal values.All comparisons are with a reference signal that approximates the mean of the prior distribution.In this example is taken as the mean of 50 independent runs of the SW dynamics with noise using the same initial value Z 0 and the same boundary conditions.Furthermore, in Figure 17 we present a snapshots of the hidden signal, the filter and their difference after 32hrs.Also shown in the snapshots the tracks of the drifters in red and blue.Red tracks refer to the mean spatial locations of the drifters computed according to the reference signal, whereas the blue tracks refer to the spatial locations obtained from the data set in [17].The latter are not used by the algorithm, which manages to estimate them on the fly.
A comparison of the absolute differences (third row) in Figure 13 and in the known observer trajectory case Figure 17 cannot be made directly due to the differences in the reference and the hidden signal.In Figure 17, it is notable that despite the lack of observer position trajectories, the difference between posterior and prior mean suggest informative likelihoods for u, v and that the  posterior does manage to gain significant information for these variables.In contrast for η these differences are smaller and learning occurs via the dependence of η with u, v in the dynamics.The results for this example again show that the filtering technique presented here is quite effective even when the spatial locations of the observations are assumed unknown.Finally, we note that these simulations took around 68.5 hours hours to run 26 independent repeats on 52 cores.

Discussion
In this article we have considered the adaptation of the SMCMC method in the context of a class of high-dimensional filtering problems.We showed in several examples that there is great promise in the application of this approach relative to some of the existing state-of-the-art numerical methods.In addition, the method is provably convergent in several contexts, even for non-linear problems.It should also be noted that the methodology that is presented is not restricted to the class of problems considered in this article and in general can be applied to many filtering problems as is evidenced in e.g.[11].It is simple to extend the models to include static (in-time) and unknown parameters in the model, which are to be estimated using Bayesian methods.This estimation can be performed using the methods considered in this article.
As, naturally, SMCMC depends upon an MCMC kernel, the ability of this kernel to explore the state-space at a cost that is competitive with existing methods is rather important.As several articles have considered the behaviour of MCMC in high-dimensions (e.g.[18]) and sometimes its success, one might expect that this aspect is not a bottleneck to its application.None-the-less, many high-dimensional MCMC kernels need care in their design, so the application of SMCMC is far from fool-proof.Intrinsic to the application of the method for filtering is an approximation of the filtering distribution: the mathematical performance of the algorithm, taking into account this approximation and when considering high-dimensions is yet to be studied and an important line of future work.
an R dy -vector valued function.We give below in Algorithms 3-4 the pseudocode to implement Algorithm 1 for this example with a cost of O(dN ).Using an auxiliary variable method (e.g.see [2, Section 5]) instead of sampling from one samples from the joint distribution where p(j) is uniform distribution over {1, • • • , N } so that π N k (z t k , j) admits π N k (z t k ) as its marginal.In practice, one would run M independent runs of Algorithms 3-4 in parallel then use averages for performing inference.
Algorithm 4 discards the first N burn steps required to converge to stationarity.In addition, note that the specification of Q ′ in Algorithm 4 (the covariance of the proposal distribution) requires careful design.When considering the SW model our approach was to construct a noise similar to the one described in Section 4.3.In our simulation we set W ′ as W t k in Section 4.3 except that the random Gaussians ϵ •,(i,j) t k are sampled from N (0, σ ′2 /(i ∨ j + 1)) where σ ′ = 5 × 10 −2 .This choice leads to an acceptance ratio α of the range of 0.2-0.3.For the auxiliary variable we used a simple random walk with q = 0.33.This is a simple option that was effective here, but more elaborate schemes are possible by defining appropriate discrete transition matrices for {1, . . ., N }.
We also note that the implementation presented in Algorithms 3-4 is not the most efficient and significant computational savings are possible.The current presentation was chosen for the sake of clarity, but it should be noted that step 2 (a) in Algorithm 3 does not need to be executed for every i ∈ {1, . . ., N } but only for the sampled indices {j i } N burn +N i=1 . This means that one can move step 2(a) of Algorithm 3 in Algorithm 4 and compute Z (i) t k only once (and then save in memory) for each i ∈ {j i } N burn +N i=1 .Given this involves an expensive PDE evolution between Z (i) and the number of unique elements in {j i } N burn +N i=1 can be considerably less than N (due to the rejections and random walk exploration), this approach can lead to a substantial computational saving.

B Numerical solution of the SWE
To write the SW equations in a compact form, we introduce the following vectors where λ x i * ,j * ,max is the maximum eigenvalue of the Jacobian matrix ∂A(U )/∂U evaluated at U

C Implementation of the noise
We show how the noise is computed for each of the η, u, v variables.As the procedure is the same ∈ R Nx×J , and let ϵ t k be a random J × J matrix with independent entries, ϵ ij t k ∼ N (0, σ 2 /(i ∨ j + 1)), for i, j ∈ {0, • • • , J −1}, with σ > 0. Define Ξ t k := S 1 ϵ t k S T 2 ∈ R Ny×Nx and let X t k := Vec(Ξ t k ) ∈ R NxNy .Then X t k is a random sample of the noise, and one has E[X t k ] = 0 and We drop the subscript t k from Ξ t k , then where Ξ ij is the element of Ξ in the i th row and the j th column.The noise covariance matrix is then the above matrix with element-wise expectations given by In Figure 19, we show a random sample of the noise used in the rotating SW model.

Algorithm 2
SMCMC Method for Filtering with Unknown Locations for n time steps.1. Initialize: Compute x t 1 using the recursion formulae (10)-(11) and for any given initial distribution on Z run a MCMC kernel of invariant measure π 1 for N steps.Keep in memory π N 1 (φ).Set k = 2 2. Update: Compute x t k using (13)-(15) and for any given initial distribution on Z run a MCMC kernel of invariant measure π N k for N steps.Keep in memory π N k (φ).Set k = k+1.If k = n+1 go to step 3. otherwise return to the start of step 2.. 3.Return the approximations { π N k (φ)} k∈{1,...,n} .

Figure 1 :Figure 2 :
Figure 1: (Model in Section 4.1.)Comparison of machine times of SMCMC versus ensemble methods for different state dimensions d at fixed accuracy.This is the total time spent on all simulations when the fraction of absolute errors smaller than σ y /2 is approximately 70%.

Figure 4 :
Figure 4: (Model in Section 4.2.)This illustrates the means of the KF (truth), LEnKF and SMCMC for the unobserved state variable Z (2) n

Figure 7 :Figure 8 :
Figure 7: (Model in Section 4.2.)Histogram of absolute errors |Filter µ − KF µ | at all state variables and at all times.Here we take d = 1600 and d y = 400.The percentage of occurrence here is defined as the number of elements in the bin divided by the total number of elements d × (T + 1).

Figure 9 :
Figure 9: The illustration depicts the process of selecting which state variables to be observed based on the drifter's location.The red circles represent a drifter at various points in time, and the red curve indicates its track.The blue circles correspond to the nearest surrounding grid point at the times of observation.

Figure 10 :
Figure 10: (Known locations example) Histogram of absolute differences: |Filter Mean − Signal| at all state variables and at all times.

Figure 11 :
Figure 11: Snapshot of simulation at time = 14hrs for the SW model with observations of known spatial locations

Figure 12 :
Figure 12: Snapshot of simulation at time = 26hrs for the SW model with observations of known spatial locations.

Figure 13 :
Figure 13: Snapshot of simulation at time = 32hrs for the SW model with observations of known spatial locations.We present the hidden signal, the filter mean and their difference for the rotating shallow-water HMM with observations of known locations.Red curves illustrate the tracks of drifters in the region, which are moving according to the signal.

Figure 14 :
Figure 14: (Unknown locations example) Histogram of learning differences: |Filter Mean − Prior| at all state variables and at all times.The percentage of occurrence here is defined as the number of elements in the bin divided by the total number of elements d × (T + 1).

Figure 15 :
Figure 15: Snapshot of simulation at time = 14hrs for the SW model with observations of unknown spatial locations

Figure 16 :
Figure 16: Snapshot of simulation at time = 26hrs for the SW model with observations of unknown spatial locations.

Figure 17 :
Figure 17: Snapshot of simulation at time = 32hrs for the SW model with observations of unknown spatial locations.We show the prior mean (reference signal), the filter mean and their difference for the rotating shallow-water HMM with observations of unknown spatial locations.The blue curves illustrate the tracks of the drifters according to the data from [17], while the red curves illustrate the mean of 50 tracks of the drifters that are moving according to the prior.The red and blue tracks are presented here only for the purpose of illustration; Algorithm 2 does not use the blue tracks but aims to estimate them.See also Figure 18 in which the region [4 × 10 5 , 7 × 10 5 ] × [4 × 10 5 , 7 × 10 5 ] is zoomed-in to show the difference between the blue and red drifters' tracks.

Figure 19 :
Figure 19: A random sample of the noise as described in this section.
, withAlgorithm 1 Sequential MCMC (SMCMC) Method for Filtering for n time steps.1. Initialize: For any given initial distribution on Z run a MCMC kernel of invariant measure π 1 for N steps.Keep in memory π N 1 (φ).Set k = 2. 2. Update: For any given initial distribution on Z run a MCMC kernel of invariant measure π N k for N steps.Keep in memory π N k (φ).Set k = k + 1.If k = n + 1 go to step 3. otherwise return to the start of step 2.. 3.Return the approximations { π N k (φ)} k∈{1,...,n} .the overall cost then cut to O((κ(d) + κ y (d y ))N ), using a simple auxiliary variable method on the sample indicator, which is what we will implement; see Appendix A for details.The cost due to dimensionality varies depending on the modeling, e.g. in a Gaussian density as in Example 2.1 calculating f k depends on the covariance matrix:κ(d) = d if it is diagonal; κ(d) = d(d + 1)/2 for a general dense covariance matrix assuming a Cholesky decomposition can be performed offline and Q is the same for every k; also, in the case of a regular grid and stationary Gaussian noise, one can exploit the Toeplitz structure of the matrices involved in calculations and use FFT, leading to a cost of κ(d) = d log(d).

Table 1 :
(Model in Section 4.1) Comparison of SMCMC and ensemble methods for different state dimensions d.