Efficient computation of matrix-vector products with full observation weighting matrices in data assimilation

Recent studies have demonstrated improved skill in numerical weather prediction via the use of spatially correlated observation error covariance information in data assimilation systems. In this case, the observation weighting matrices (inverse error covariance matrices) used in the assimilation may be full matrices rather than diagonal. Thus, the computation of matrix-vector products in the variational minimization problem may be very time-consuming, particularly if the parallel computation of the matrix-vector product requires a high degree of communication between processing elements. Hence, we introduce a well-known numerical approximation method, called the fast multipole method (FMM), to speed up the matrix-vector multiplications in data assimilation. We explore a particular type of FMM that uses a singular value decomposition (SVD-FMM) and adjust it to suit our new application in data assimilation. By approximating a large part of the computation of the matrix-vector product, the SVD-FMM technique greatly reduces the computational complexity compared with the standard approach. We develop a novel possible parallelization scheme of the SVD-FMM for our application, which can reduce the communication costs. We investigate the accuracy of the SVD-FMM technique in several numerical experiments: we first assess the accuracy using covariance matrices that are created using different correlation functions and lengthscales; then investigate the impact of reconditioning the covariance matrices on the accuracy; and finally examine the feasibility of the technique in the presence of missing observations. We also provide theoretical explanations for some numerical results. Our results show that the SVD-FMM technique has potential as an efficient technique for assimilation of a large volume of observational data within a short time interval.


Introduction
In variational data assimilation (e.g., Lorenc et al., 2000;Rawlins et al., 2007), a nonlinear least squares problem is solved, where observations and model forecasts are blended, taking account of their uncertainties. The minimization procedure involves time-consuming computations of large matrix-vector products. In this paper we focus on the matrix-vector products involving the observations. These take the form R −1 d, where R −1 ∈ R m×m is the inverse of the observation error covariance matrix and d ∈ R m is the observationminus-model departure vector (see section 2).
In some practical numerical weather prediction applications, observation errors are assumed to be uncorrelated, resulting in the matrix R being diagonal. This reduces the number of operations required to compute matrix-vector products in the minimization, and is a pragmatic strategy when the characteristics of the observation uncertainty are not well understood (e.g., Liu and Rabier, 2003). However, a number of idealized and operational studies have shown that there are significant benefits to treating full observation error covariance matrices in terms of analysis information content, analysis accuracy and forecast skill, even when knowledge of the observation error correlations is only approximate (e.g., Healy and White, 2005;Stewart et al., 2008Stewart et al., , 2013Weston et al., 2014;Bédard and Buehner, 2020). In particular, implementation of spatial observation error correlations modifies the length-scales of the observation increments computed by the assimilation (e.g., Fowler et al., 2018;Rainwater et al., 2015;, which may be especially important for multiscale systems such as convectionpermitting numerical weather prediction and reanalyses (e.g., Hu and Franzke, 2020) or coupled ocean-atmosphere systems (e.g., Hu and Franzke, 2017). Furthermore, practical methods have been developed to estimate the observation uncertainty characteristics for a range of observation types (see the reviews by Janjić et al. (2018) and Tandeo et al. (2020)). For example, Doppler radar radial winds, geostationary satellite data and atmospheric motion vectors (AMVs) have been shown to exhibit strong spatial error correlations (Waller et al., 2016a,b;Cordoba et al., 2017;Honda et al., 2018). We note that in practical applications, the matrix R is typically treated as block diagonal with one block per observation type (for a given time). The observation errors between different types of observations are assumed to be uncorrelated.
To give an idea of the expected size of the spatially correlated observation error covariance matrices, we consider geostationary satellite observations from the SEVIRI (Spinning Enhanced Visible and Infrared Imager) instrument (Waller et al., 2016b;Michel, 2018;Schmetz et al., 2002). SEVIRI measures top of the atmosphere radiances in 12 channels, with a 15-minute repeat cycle and at approximately 3 km spatial resolution (excluding the high resolution visible (HRV) channel; Waller et al., 2016b;Schmetz et al., 2002). Each image consists of around 10 7 pixels for the thermal infra-red and solar channels (Schmetz et al., 2002). Hence, there are a vast number of SEVIRI observations, even for a regional domain. For instance, the number of SEVIRI observations within the domain covered by the operational AROME model from Météo-France is around 4 × 10 5 (Seity et al., 2011). In operational assimilation applications, to avoid treating spatially correlated observation errors, spatial thinning of SEVIRI observations is typically applied and this reduces the number of observations by several orders of magnitude, but also reduces their benefits in improving forecast skill (Waller et al., 2016b;Michel, 2018). The next generation of meteorological satellites will produce observations with even higher spatial resolutions (WMO, 2017, Table 1). For example, the Flexible Combined Imager (FCI) on the Meteosat Third Generation (MTG) satellite and the Advanced Baseline Imager (ABI) on the Geostationary Operational Environmental Satellite-R (Schmit et al., 2017) both take measurements at approximately 0.5 -2 km spatial resolution, and the Advanced Geosynchronous Radiation Imager (AGRI) on the Fengyu-4 geostationary meteorological satellite produces observations at a resolution from 1-4 km (Yang et al., 2017).
Accounting for spatially correlated error statistics in the data assimilation algorithm has potential to increase the computational cost for several reasons: (1) the computation of large matrix-vector products using parallel computing techniques, (2) the need to compute products using the inverse covariance matrix which may be different in each assimilation cycle, (3) changes to the convergence behaviour of the minimization procedure. We now review these issues in more detail.
The first issue is that the parallel computation of a large matrix-vector product with observation data distributed across multiple processing elements (PEs) may become expensive due to excessive communications between PEs and load imbalance overheads (Deng, 2012). Different partitions of the matrix will lead to different parallelization schemes, which will be described in section 2. In general, in order to reduce the time spent on communication operations, it is necessary to reduce the number and/or size of messages transferred. In the context of data assimilation, much of this communication can be avoided if all the observations with mutually correlated errors are assigned to one PE, as in . However, this relies on the data volume of observations with mutually correlated errors being small enough for the product to be calculated on one node. It is an open question how to implement computations for larger data volumes with distributed data. While our paper is focused on parallel implementations for variational assimilation, studies such as Anderson (2003) and Nino-Ruiz et al. (2019) address parallel implementations for ensemble methods.
The second issue is that there is a need to use the inverse observation error covariance matrix (R −1 ) in the computations. The most commonly used observation uncertainty diagnosis techniques provide an estimate of the observation error covariance matrix itself rather than its inverse (Desroziers et al., 2005). Since the observation distribution changes each assimilation cycle (due to quality control etc.) there is a different inverse matrix each cycle.  deal with this by using a Cholesky decomposition method (Golub and Van Loan, 1996) which avoids the need to compute the inverse covariance matrix directly and is applicable to any form of error covariance matrix. Guillet et al. (2019) model the inverse of a spatially correlated observation error covariance matrix directly using a diffusion operator and fast unstructured meshing techniques. However, this method only deals with spatial error correlations and it is unclear if this approach is also suitable when spatial error correlations and inter-channel error correlations are combined.
The third issue is that the use of correlated observation error statistics changes the convergence behaviour of the minimization procedure. Indeed, the pre-operational experiments of Weston et al. (2014) showed problems with the minimization that were improved by reconditioning the observation error covariance matrix. The convergence of the minimization procedure has been further studied by Tabeart et al. (2018Tabeart et al. ( , 2020aTabeart et al. ( , 2021 who found that the minimum eigenvalue of the observation error covariance matrix is a key parameter governing the speed of convergence. Hence, the use of reconditioning methods is common in operational applications with correlated observation error statistics (e.g., Bormann et al., 2016;Campbell et al., 2017;Tabeart et al., 2020a,b).
The aim of this paper is to carry out an investigation of a method to accelerate parallel matrix-vector products with distributed observational data, using a novel application of a well-known numerical approximation algorithm, the fast multipole method (FMM; Rokhlin, 1985;Greengard and Rokhlin, 1997). We explore a particular type of FMM that uses a singular value decomposition (SVD) (Gimbutas and Rokhlin, 2003). We call this method the SVD-FMM. The key idea of the SVD-FMM is to split up the computation of the matrix-vector product into separate near-field and far-field calculations. The near-field calculations are done by a standard matrix-vector multiplication of a local sub-matrix and sub-vector. The far-field calculations are carried out using the singular values and singular vectors of the sub-matrices of R −1 . The sub-matrices and sub-vectors are determined by selecting specific rows and columns of R −1 corresponding to a partition of observations in the domain. The number of singular vectors employed in the calculation is chosen by the user. We will show that usually only a small number of singular vectors is needed to obtain a good accuracy (section 5). The SVD-FMM technique reduces the number of floating point operations required compared with the standard approach for matrix-vector multiplication. Additionally, we develop a novel possible parallel algorithm for the SVD-FMM for our application in data assimilation, which can greatly reduce the costs of communication between PEs. The SVD-FMM allows us to assimilate a large volume of observational data in a short time interval, and has the potential to be used in practical applications. In this initial investigation, our numerical results focus on evaluating the accuracy of the SVD-FMM. Thus our experiments are carried out for an idealized problem, using serial, rather than parallel computing. Nevertheless, we do compare the efficiency of the proposed parallelization scheme with different parallel formulations of standard matrix-vector multiplications in terms of communication costs. In our current implementation, the method needs to be applied after we obtain a representation of the matrix R −1 . In order to have a clear focus solely on computing matrix-vector products we do not address the computation of the inverse observation error covariance matrix. Instead, we assume that the inverse observation error covariance is already known.
In our experiments we show that the SVD-FMM can work well with the inverses of a variety of covariance matrices. In particular, we apply the SVD-FMM to the inverses of covariance matrices created using different correlation functions and lengthscales. We also use the reconditioning methods of Tabeart et al. (2020b) together with the SVD-FMM to gain insight into how the accuracy of the SVD-FMM should change with different levels of reconditioning. In practice, the observation distribution varies each assimilation cycle due to factors such as quality control and the removal of cloudy satellite radiance observations. Therefore, we further carry out some experiments to demonstrate that the SVD-FMM is feasible even if there are some missing observations. The rest of this paper is organised as follows: We provide the parallel formulations of standard matrixvector multiplication in section 2. In section 3 we present our novel algorithm for the SVD-FMM and compare the complexity and the communication costs between the SVD-FMM and standard, parallel matrix-vector multiplication. In section 4 we explain our experimental design for our idealized experiments. In particular, we describe the generation of observations and observation error covariance matrices as well as the reconditioning methods. In section 5 we show the results of our numerical experiments with varying correlation functions, length scales, reconditioning methods and condition numbers. We also show how the results change with missing observations. Finally we give a summary in section 6 and conclude that our proposed algorithm has potential for use in operational data assimilation for fast computation of large matrix-vector products.

Parallelization of standard matrix-vector multiplication
In this section we describe three distinct standard parallel formulations for computing large matrix-vector products (see Grama et al., 2003, Section 8.1;Deng, 2012, Section 6.2.1). We will also discuss how to exploit the symmetric structure of R −1 for parallelization (Geus and Röllin, 2001). These matrix-vector products arise in the solution of the variational minimization problem in data assimilation in the form where A ∈ R m×m denotes the inverse of the observation error covariance matrix, d ∈ R m denotes the observation-minus-model departure vector, and q ∈ R m denotes the result of the multiplication. For the purposes of this description, we assume that the matrix A and vector d are already known. To see how matrix-vector products in the form of Eq. (1) arise in data assimilation, we consider the observation penalty term of the variational data assimilation cost function. This is given by multiplying the inverse observation error covariance matrix by the observation-minus-model differences where y ∈ R m denotes the observation vector, x ∈ R n denotes model state and H denotes the observation operator that maps model state to the observation (H : R n → R m ; e.g., Lorenc et al., 2000;Rawlins et al., 2007;Nichols, 2010). To solve the variational minimization problem, the gradient of Eq.
(2) is needed (e.g., Thus matrix-vector products of the form of Eq. (1) arise in both the computation of the cost function and its gradient, with A = R −1 and d = y − H(x). The parallelization schemes for computing Eq. (1) start with the distribution of observations over PEs. We consider a simple domain decomposition, in which the observations are distributed over a number of PEs according to their geophysical locations. This will result in a split of the components of matrix A and vector d across PEs. We will introduce four different partitions of the matrix (see Fig. 1). Each of them will lead to a unique parallelization scheme. The communication costs of each scheme depend on various parameters, including i) the time to prepare a message for transmission, ii) the time it takes for a message to travel (latency), iii) how many words can traverse per second (bandwidth), iv) how many PEs to communicate with and v) the message size (Grama et al., 2003). Since the first three parameters are determined by the configuration of the parallel machine, we discuss the communication costs for different parallelization schemes using the last two parameters.
(a) Row-wise partitioning Figure 1: Different ways of partitioning matrices (represented by squares) and vectors (represented by bars) for parallel computation of the matrix-vector products. Different colours demonstrate the portions of matrices and vectors that are distributed over four PEs (P 0 , P 1 , P 2 and P 3 ).

Row-wise partitioning
We first consider a row-wise partitioning, in which case each PE stores one row or several rows of A and one element or a portion of d. Fig. 1(a) illustrates the partitioning using four PEs. In order for each PE to perform its computation, we need to distribute the full vector among all the PEs. This requires an allto-all broadcast. Based on table 4.1 in Grama et al. (2003), the communication cost of this communication operation is ts log B + twm, where B is the number of PEs, ts is the startup time and tw is the perword transfer time (Grama et al., 2003). The two time parameters depend on computer architecture and performance. After the communication, each PE multiplies its m/B rows with the vector, which requires on the order of m 2 /B floating point operations.

Column-wise partitioning
Instead of storing row(s), each PE can store the column(s) of A. Fig. 1 the local computation, we need to perform an all-to-one reduction to sum up the partial results given by each PE, which takes time (ts + twm) log B (Grama et al., 2003, Table 4.1). After the all-to-one reduction only one PE contains the full vector q and thus, we may need to redistribute q. This requires a scatter operation which takes time ts log B + twm (Grama et al., 2003, Table 4.1). Although the column-wise partitioning circumvents the use of an expensive all-to-all broadcast operation, it increases the message size for data-transfer operations. The message size for the all-to-one reduction is m, whereas the message size for the all-to-all broadcast used for the row-wise partitioning is m/B. It should be noticed that in data assimilation applications, the matrix A is known to be symmetric, which means that the parallel algorithm using column-wise partitioning can also be used with row-wise partitioning.

Block 2-D partitioning
The row-wise partitioning and column-wise partitioning are 1-D partitioning. We now consider a 2-D partitioning, which distributes the blocks of A among PEs. An example of the block 2-D partitioning is shown in Fig. 1(c). Note that the matrix A is equally separated by 4 PEs, while the vector d is only distributed among 2 PEs that own the diagonal blocks of A, i.e., P0 and P3. The first step is for each PE that stores a portion of d to broadcast that portion to the other PEs in the same column, which requires a column-wise one-to-all broadcast. The communication time of this operation is (ts+twm/ √ B) log √ B. The next step is to perform a row-wise all-to-one reduction, which requires another (ts+twm/ √ B) log √ B times. Grama et al. (2003) showed that computing the matrix-vector product using the block 2-D partitioning of the matrix can be faster than using 1-D partitioning for the same number of PEs. However, a potential problem is that when using the block 2-D partitioning, the vector elements are not distributed among all the PEs and this can result in load imbalance.

Partitioning for symmetric matrices
A possible partitioning of A and d taking into account the symmetric structure of A is shown in Fig.  1(d). The first step is to exchange the portions of d among PEs. This requires an all-to-all broadcast with an average message size of m/B. It should be noticed that not all PEs need to send their portion of d to all other PEs. In the case of four PEs, P0 sends data to P1, P2 and P3, P1 to P2 and P3, and P2 to P3. The second step is to multiply the vector elements with the local part of the matrix. The last step is to exchange the results of local computation using an all-to-one reduction. The average message size for this operation is m/B. In the case of four PEs, P0 collects the results from all other PEs, P1 collects the results from P2 and P3, and P2 collects the result from P3. The last PE does not need to collect the results from other PEs. Slightly different parallelization schemes may be used for this kind of partitioning, depending on the computer architecture (Geus and Röllin, 2001). This partitioning saves the storage space of each PE (only half of the matrix is stored) and keeps the load balanced. Furthermore, this partitioning has a very large advantage for sparse matrices, in which case each PE only needs to communicate with neighbouring PEs (Geus and Röllin, 2001).
In the next section, we describe a different approach for calculating large matrix-vector products, that is applicable to any matrix (most useful for full or dense matrices). This approach reduces the computational cost for serial calculations of Eq. (1) and the message size for communication operations.

Application of the fast multipole method (FMM) to data assimilation
The fast multipole method (FMM) was originally presented for the rapid evaluation of all pairwise interactions between a large number of charged particles (Rokhlin, 1985;Greengard and Rokhlin, 1997). The mathematical form of this fast summation is equivalent to Eq. (1). While the direct calculation of Eq. (1) requires O(m 2 ) work, the FMM needs only O(m) operations. In addition, the FMM relies on a hierarchical division of the computational domain, which is well suited for parallel computing. In the classical FMM, the matrix A is given by some functions describing pairwise interactions between particles, such as those describing gravitational and electrostatic potentials. In our application, however, the matrix A is obtained by inverting the matrix R. Therefore, we need a generalized FMM that does not require an underlying analytic function (Gimbutas and Rokhlin, 2003).

Separation of the matrix-vector product into near-and far-field terms
Since the SVD-FMM is not commonly used in meteorology, we first explain the key idea behind the technique and give a simple example to aid the reader's understanding, before going into the mathematical (1) in two parts: where the column indices of A and the indices of d are divided into two mutually independent sets denoted by I1 and I2. The indices are selected according to a partition of observations (described in more detail in section 3.2). The first part of Eq. (4), referred to as the near-field calculation, is computed using a standard matrix-vector multiplication and the second part, referred to as the far-field calculation, is estimated using an approximation. In the SVD-FMM, the far-field calculation is computed using the singular value decomposition (SVD) of A(·, I2). We provide a simple example to aid the reader's understanding.
Example 3.1 Suppose we have a matrix A ∈ R 2×3 and a vector d ∈ R 3 given by A standard matrix-vector multiplication gives We can also compute q by partitioning the problem as in Eq. (4) by letting I1 = {2} (the near field) and I2 = {1, 3} (the far field), such that The singular value decomposition (SVD) of the sub-matrix in the second term is given by where u1, u2 ∈ R 2 are the orthonormal left singular vectors, σ1, σ2 are the scalar singular values, and v1, v2 ∈ R 2 are the orthonormal right singular vectors. If σ1 σ2 then we can truncate the SVD to give an approximation for the far field term of Eq. (5) as More generally, for a larger problem, we can use only the first few singular vectors and singular values to estimate the far-field calculation. If the matrix A remains fixed, we can use the same singular vectors and singular values to compute q for any vector d. This will reduce the cost for calculating Eq. (1) (unless the matrix A is sparse). Furthermore, storing the singular vectors and singular values requires less memory  than storing the full far-field sub-matrix. Note that if the matrix A changes too often, then we would need to perform the SVD again and again, which would be computationally expensive.
The example we have discussed in this section only shows the basic idea of the SVD-FMM. The actual SVD-FMM algorithm is multi-level and more complicated. Before we can address this, we explain the partitioning of the observations that is needed for the multi-level algorithm.

Partition of observations into a hierarchical structure of nested boxes
In this section, we describe the partition of observations and explain the notation that we use. Note that we use the nomenclature found in the FMM literature (e.g., Gimbutas and Rokhlin, 2003). Suppose that we have m observations. These could be located on a latitude-longitude grid, or irregularly distributed. We first find a minimal square (or rectangle) that covers the locations of all of the observations, and then hierarchically subdivide the observation domain into smaller and smaller boxes to generate a box-treea hierarchical structure of nested boxes. An example is given in Fig. 3. In the tree structure, level 0 refers to the biggest box that covers the entire observation domain, and level l + 1 refers to the boxes that are obtained from level l by subdividing each box into four smaller boxes of equal size. In our particular example, we choose level 3 as the highest level, which is the minimal number of levels required for the present SVD-FMM approach. The boxes at the highest level are called leaf boxes.
In practical applications, the number of levels is determined such that the averaged number of observations in the leaf boxes is smaller than a prescribed value. For unevenly distributed observations, an adaptive tree structure could be created (Gimbutas and Rokhlin, 2003), in which smaller boxes are generated only where data is dense.
We number the boxes in all levels except level 0 by a Z-order curve as shown in Fig. 3 Figure 4: Illustration of the multipole-to-multipole translation operator: 1982). This facilitates easy formulae for box indexing as will be described later in this subsection. We call boxes neighbours if they are on the same level and connect to each other. In each level the near-field of a box b is made of itself and all its neighbours. The far-field of the box b is made of everything else. We use N b and F b to denote the lists of boxes that are in the near-field and far-field of box b, respectively. For example, the near-field of box 16 in Fig. 3 consists of the 9 shaded boxes, namely, N16 = {7, 10, 11, 13, 15, 16, 17, 18, 19}, and the far-field of box 16 is the other 7 boxes, i.e., F16 = {4, 5, 6, 8, 9, 12, 14}. The smallest number of boxes in a box's near-field is 4, which occurs when the box is on a corner of the observation domain.
In a box-tree the children of box b in level l, denoted by C b , refer to the four boxes in level l + 1 that are subdivided from itself, such as C4 = {20, 21, 22, 23}. In the Z-curve ordering, the indices of the children of box b are given by 4b + 4, 4b + 5, 4b + 6, and 4b + 7. The parent of box b, denoted by P b , refers to the box in a coarser level that contains box b. For instance, The interaction list of box b, denoted by L b , is the set of boxes which are children of the neighbours of box b's parents and which are in the far-field of box b. For example, L68 = {32, 33, 34, 44, 45, 48, 49, 50, 51, 56, 58, 64, 65, 66, 67, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83} and L16 = {4, 5, 6, 8, 9, 12, 14}. Note that the interaction list of a box in level 2 is exactly the same as its far-field.

The SVD-FMM algorithm
In this section we describe the multi-level SVD-FMM algorithm (Gimbutas and Rokhlin, 2003). For concreteness we choose a particular configuration of boxes (as shown in Fig. 3) to describe the algorithm, but it can be generalized to other configurations. For each box b we can write Eq. (4) as where I b , IN b and IF b denote the sets of observation indices in box b, N b and F b , respectively, and A that are comprised of specific rows and columns of A. We call q( The first matrix-vector product in Eq. (6) is calculated from the near-field components. It is computed directly without using any approximation. The second matrix-vector product in Eq. (6), containing the far-field components, is estimated by matrix decomposition, using a multi-level approach exploiting the hierarchical box-tree structure established in section 3.2. Before giving the mathematical details, we first give an overview of the algorithm.
The key idea of the multi-level approach is to perform the SVD of the sub-matrices of A given by A(I b , IF b ) and use the singular vectors and singular values just obtained to create a multipole expansion, a local expansion and three translation operators, and then use them to estimate the result. The multipole expansion can be interpreted as a short representation of the sub-vector of d given by d(I b ), which is obtained by projecting the components of d(I b ) onto the basis given by p singular vectors and hence, is a short vector with p components. The local expansion can be considered as the short representation of the sub-vector of d given by d(IF b ), which is also a vector with p components. It is computed from the multipole expansions of a group of boxes that are in b's interaction list using the translation operators. The translation operators allow exploitation of the hierarchical structure established in the box-tree, by transforming the projection from one basis of singular vectors into another basis of singular vectors. There are three kinds of translation operators: the multipole-to-multipole (M2M) translation operator transforms the multipole expansion of a box to the multipole expansion of its parent (see Fig. 4); the multipole-to-local (M2L) translation operator translates the multipole expansion of a box to the local expansion of another box in the same level (see Fig. 5); and the local-to-local (L2L) translation operator converts the local expansion of a non-leaf box to the local expansion of its children (see Fig. 6).
Level 3 Level l + 1 T L2L Figure 6: Illustration of the local-to-local translation operator: T L2L transforms the local ex- Due to the use of these expansions and translation operators, the SVD-FMM computes matrix-vector products more efficiently than the standard approach, which is reflected in both serial efficiency (algorithmic complexity) and parallel efficiency (see sections 3.5 and 3.6 for more details).
We now describe the mathematical steps of the SVD-FMM algorithm in more detail. A schematic illustrating the algorithm is given in Fig. 7.
Initialization: Compute the SVD of sub-matrices of A and the translation operators. We compute truncated SVDs of sub-matrices of A for each box b in level 2 and 3 The singular vectors and singular values are then used to generate multipole expansions and local expansions and a set of translation operators. The M2M translation operator (see Fig. 4) converts the multipole expansion of a child box into the multipole expansion of its parent and is defined as  and v src,b k that correspond to the i-th observation in box b . The L2L translation operator (see Fig. 6) transfers the local expansion of a parent box to its children and is defined as Step 1: Compute the multipole expansion for the leaf boxes. The multipole expansion for each leaf box is computed by where b = 20, · · · , 83 and k = 1, · · · , p.
Step 2: Compute the multipole expansion for the non-leaf boxes. The multipole expansion for each box in level 2 is computed by translating the multipole expansions from its children in level 3 using the T M 2M translation operator, where b = 4, · · · , 19, k = 1, · · · , p and b denotes the child of b.
Step 3: Compute the first part of the local expansion. For each box in level 2 and 3, we translate the multipole expansions from boxes in its interaction list into local expansions by Step 1: Compute multipole expansions (•) for leaf boxes (Eq. 12) Level 3 Step 2: Compute multipole expansions for boxes in level 2. Arrows show which multipole expansions in level 3 are used to compute which multipole expansion in level 2 (Eq. 13).
Level 3 Step 3: Compute local expansions (+) of each box in level 2 and 3 from multipole expansions (Eq. 14). The two figures show the computation of the local expansion for one box in each of level 2 and 3.
Level 3 Step 4: Transform local expansions from each box in level 2 to local expansions of their children (Eq. 15). Arrows show which local expansion in level 2 is transformed into which local expansion in level 3. Step 5: Complete the local expansion of leaf boxes by adding the local expansions from leaf boxes obtained in Step 3 and Step 4 together (Eq. 16). where b = 4, · · · , 83, k = 1, · · · , p and b is in the interaction list of b.
Step 4: Compute the second part of the local expansion. We transfer the local expansions for each box in level 2 to their children in level 3 using the T L2L translation operator: where b = 20, · · · , 83, k = 1, · · · , p and b is b's parent.
Step 5: Complete the local expansion for the leaf boxes. For each leaf box, the final local expansion is given by adding two parts together: where b = 20, · · · , 83 and k = 1, · · · , p. Final Step: Adding the far-field calculation to the near-field calculation. The final result for each leaf box (b = 20, · · · , 83) is obtained by adding the far-field calculation to the near-field calculation: where i ∈ I b . The near-field calculation is given by The far-field calculation is given by We note that the stepwise algorithm presented does not give a necessary order for the calculations. For example, the computation of local expansions for leaf boxes in Step 3 can be carried out once Step 1 is done.

Accuracy
In this section we discuss which factors affect the accuracy of the SVD-FMM. The first aspect to consider is the relative magnitude of the near-field and far-field terms in Eq. (6). The near-field term is computed directly, while the far-field term is approximated. Numerical roundoff error for the direct calculation of the near field term is expected to be small. Therefore, the main source of error for the SVD-FMM is from the far-field calculation. Thus, if the magnitude of the near-field term is much larger than the far-field term, the error in the final result due to the SVD-FMM approximation will tend to be small.
The accuracy of the far-field calculation is determined by the accuracy of the truncated SVD of the sub-matrices A(IF b , I b ) or A(I b , IF b ) in Eq. (7) and the accuracy of the translation operators given by Eqs. (8 -10). The truncation error in Eq. (7) is dependent on the number of singular vectors used (p) and the (p + 1)th singular value of the sub-matrices A(IF b , I b ) or A(I b , IF b ) (e.g., Bernstein, 2009, Fact 9.14.28). Additionally, Gimbutas and Rokhlin (2003) show that the error bounds on the translation operators rely on the (p + 1)th singular value of the sub-matrices. Thus, the optimal number of the singular vectors used in the SVD-FMM may depend on the application. It should be determined by considering the trade-off between accuracy and efficiency, as the more the singular vectors are used, the more accurate the results but the slower the computation.
The SVD-FMM algorithm uses truncated SVDs (Eq. (7)) and thus, may cause rank deficiency of the matrix A. Nevertheless, the Hessian of the full variational problem will still be full rank (e.g., Tabeart et al., 2021). In addition, the matrix A used in the solution of the variational problem is not exactly the inverse of the matrix R. However, only the matrix A (and not R) is used in the solution of the variational problem. Furthermore, studies have shown that even approximate forms of spatial observation error correlations provide significant benefits to analysis accuracy compared with diagonal approximations (e.g., Stewart et al., 2008Stewart et al., , 2013Healy and White, 2005).

Algorithmic Complexity
In this section we compare the number of floating point operations (flops) required for computing matrixvector products using the standard approach and the SVD-FMM. Let s = m b be the average number of observations in a leaf box (for b in the highest level), and B = m/s be the number of leaf boxes, where m is the total number of observations. Computing Eq. (11) for a fixed b and a fixed k requires 2m b operations (m b multiplication and m b add operations). Hence, calculating Eq. (11) for each value of k requires 2m b p operations. Finally, for each value of b, Eq. (11) requires 83 b=20 2m b p = 2mp operations. Similarly, Eq. (12) requires 2Bp 2 operations. Eq. (13) requires less than 2 × 27Bp 2 operations for leaf boxes and 2 × 12 × B/4 × p 2 operations for boxes in level 2, because there are at most 27 entries in the interaction list of each leaf box and 12 of each box at level 2. Eq. (14) requires 2Bp 2 operations. Eq. (17) requires at most 2 × 9ms operations, since the maximum number of boxes in the near field of each box is 9. Finally, Eq. (18) requires 2mp operations. Thus, the operation count of the entire algorithm (excluding the initialization step) approximately sums to 18ms + 4mp + 64Bp 2 or 18ms + 4mp + 64(m/s)p 2 .
The SVDs and translation operators only need to be computed once if the distribution of observations does not change. The singular vectors, singular values and translation operators can be used for any vector d. However, if the distribution of the observations varies too often, then the computational costs in the initialization step and the computational cost of inverting the observation error covariance matrix could make the algorithm very expensive.
For comparison, the direct matrix-vector multiplication of Eq. (1) requires 2m 2 operations. In data assimilation, q is often computed using a forward and backward substitution (Golub and Van Loan, 1996;Weston et al., 2014; which also requires O(m 2 ) operations. Fig. 2 compares the operation counts for direct computation and the SVD-FMM with our configuration of boxes and p = 10. We observe that once the number of observations exceeds 500, the SVD-FMM requires fewer floating point operations than direct matrix-vector multiplication, and the difference increases with the number of observations.

Parallelization
In this section we describe a novel parallel algorithm for the SVD-FMM. The FMM has several opportunities for parallelism (Greengard and Gropp, 1990) and our approach is not the only possibility. We have presented this section in order to provide a preliminary exploration of the potential of the SVD-FMM as a practical technique for operational data assimilation. However, we note that for our experimental results (section 5) we have not used this parallel algorithm. The number of observations considered in our idealized experiments is much smaller than in operational applications, so that the serial calculations can be done within an acceptable time.
We should first distribute the matrix and vector elements across PEs according to the partitioning of observations in the box-tree. For each leaf box b, we should assign the sub-vector of d given by d(I b ) and the sub-matrix of A given by A(I b , I b ∪ IF b ) to one PE. For each box b in level 2, we should allocate the sub-matrix of A given by A(I b , IF b ) to one PE. For our particular configuration of boxes, we have 64 leaf boxes and 16 non-leaf boxes in level 2, hence we could use 80 PEs. However, to make our parallelization scheme easily comparable with the parallel formulations of matrix-vector multiplication given in sections 2.1-2.4, we actually discuss the case where we choose 16 PEs out of 64 and let them store the data for level 2 boxes. We describe a possible parallelization of each mathematical step of the SVD-FMM as follows: • Parallelization of the Initialization Step.
-Each PE can calculate Eq. (7) independently. Once the singular vectors and singular values are obtained, the sub-matrices stored on each PE can be discarded.
-To compute Eq. (8), each PE that is assigned to a parent box should send v src k to three PEs that are assigned to its children. The calculation of the M2M translation operators would require 16 one-to-all broadcast operations to be performed simultaneously. The message size for each operation should be mp/16 and 4 PEs should be involved.
-To compute Eq. (9), each PE should send a portion of v tgt k to at most 27 PEs, because the maximum number of boxes in an interaction list is 27. The calculation of the M2L translation operators should require an all-to-all broadcast with the message size of mp/B.
-To compute Eq. (10), each PE should send a portion of v tgt k to the PE that is assigned to its parent. The computation of the L2L translation operators would require 16 all-to-one reduction operations to be carried out in the same time. The message size for each operation should be mF b p for b being a level 2 box. The communications required for calculating three translation operators can be done together using an all-to-all broadcast. After the initialization step, each PE that is assigned only to a leaf box will store v src k , s k , T M 2M and T M 2L , and those assigned to both a leaf box and a level 2 box will store T M 2L and T L2L .
• Parallelization of Step 1. This step is perfectly parallel.
• Parallelization of Step 2. Each PE should compute T M 2M · Φ and then send the result to the PE that is assigned to its parent. Computing the multipole expansions for level 2 boxes would require an all-to-one reduction. The message size for this communication operation should be p.
• Parallelization of Step 3. Each PE should compute T M 2L · Φ and then send the result to another PE. Each PE should collect the partial result of Eq. (13) from at most 27 PEs. This step would need an all-to-all broadcast with a message size of p or 2p.
• Parallelization of Step 4. Each PE that is assigned to a box in level 2 should compute T L2L · Φ and then send the result to the PEs that are assigned to its children. This step would use an one-to-all broadcast with a message size of p.
• Parallelization of Step 5. This step is perfectly parallel.
• Parallelization of final step. The far-field calculation for each leaf-box is perfectly parallel. For the computation of the near-field calculation, each PE should obtain the elements of d from at most 8 PEs, which is the maximum number of one box's neighbours. This would require an all-to-all broadcast. The average message size for this operation is m/B. We can use one all-to-all broadcast to complete the communication tasks for this step and step 3. Table 1 summarises the communication costs for the SVD-FMM and four parallel formulations of the matrix-vector multiplications described in section 2. The major advantage of using the SVD-FMM is that the message size for each communication operation is dramatically reduced.
In the proposed parallelization scheme, we suggested using B PEs and letting one of every four PEs conduct the computations for boxes in level 2. This could be particularly useful if the supercomputer configuration has non-uniform memory access (NUMA) nodes with locally shared memory (Grama et al., 2003). In this case, we could avoid the remote communications required in Step 2, Step 4 and similar steps in the Initialization. Alternatively, we could assign the tasks for level 2 and level 3 to different PEs, i.e., using 80 PEs for our configuration of boxes. This could allow each PE to carry out a similar amount of work. More generally, we note that the practical implementation of the given parallelization scheme should be adjusted to suit the configuration of the supercomputer.

Parallelization Scheme Communication Operation # PEs Message Size Communication Time
Row-wise all-to-all broadcast B m/B ts log B + twm Column-wise all-to-one reduction all-to-all broadcast B ≈ m/B < ts log B + twm all-to-all reduction B ≈ m/B < ts log B + twm SVD-FMM all-to-one reduction 4 p (ts + twp) log(4) all-to-all broadcast B p, 2p or m/B < ts log B + twm one-to-all broadcast 4 p (ts + twp) log(4)

Experimental Design
Our numerical experiments are designed to demonstrate the potential of applying the SVD-FMM to compute the matrix-vector products involved in operational data assimilation. We investigate the accuracy of using the SVD-FMM under different scenarios that may occur in practical applications. We conduct serial calculations rather than using parallel computing because in this initial study we have chosen to study idealized problems of moderate size as this is sufficient for our goals. For this initial study, we have not included matrix inversion as part of the SVD-FMM algorithm. However, to obtain our experimental results we require an inverse observation error covariance matrix. We use the INV function in MATLAB (MATLAB R2020b, a) to compute the inverses of R and reconditioned R. The generation of the R matrix is described in section 4.2 and reconditioning methods are given in section 4.3. The INV function uses an LDL decomposition (Golub and Van Loan, 1996). The required SVDs of sub-matrices of R −1 (denoted A) are computed using the SVDS function in MATLAB (MATLAB R2020b, b), which uses a Lanczos method and is efficient for finding a few singular values and vectors of a large matrix (Larsen, 1998;Baglama and Reichel, 2005).
In the results section (section 5) we use the log of root-mean-squared error (RMSE) to assess the accuracy of SVD-FMM, which is defined as where superscript f mm denotes the matrix-vector product computed using the SVD-FMM and ref denotes the matrix-vector product obtained by the standard approach. Note that the log(RMSE) shown in section 5 is averaged over 100 realizations of q f mm i , where each one uses a different d.

Observation distribution and observation-minus-model departures
To simulate observation data, we assume our observations are regularly distributed over a region from 54 • to 60 • N and 6 • W to 6 • E with a grid-length of 12 km. This is similar to moderately thinned geostationary satellite data used in operational forecasting over the UK (Waller et al., 2016b). This results in 3456 observations and an error covariance matrix of size 3456 × 3456. For convenience, we directly sample the observation-minus-model departure vector d from a Gaussian distribution with mean zero and (innovation) covariance given by R plus background error covariance. The background error covariance is modelled using the SOAR correlation function with a lengthscale of 20 km and a standard deviation of 0.6 (section 4). The correlation lengthscale and standard deviation are selected according to Ballard et al. (2016) to be appropriate for km-scale numerical weather prediction. The results presented in section 5 have been averaged over 100 realizations of d.
In practice, the observation distribution may be different at each assimilation cycle because of factors such as quality control. Therefore, in some of our experiments we have also applied the SVD-FMM to compute the matrix-vector product with randomly chosen missing observations.

Modelling of the observation error covariance matrix
Modelling observation error covariance matrices using correlation functions is an useful approach to deal with observation error correlations (e.g., Stewart et al., 2013;Tabeart et al., 2018). Here we use several correlation functions to model the observation error covariance matrices. The first is the Gaussian correlation function (e.g., Haben, 2011), where l > 0 is the correlation length scale and ri,j denotes the great-circle distance between two observations. The second is the first-order auto-regressive (FOAR) correlation function, also called the Markov correlation function (e.g., Stewart et al., 2013), We also use the SOAR correlation function (e.g., Daley, 1994;Tabeart et al., 2018), and the Matérn 5/2 correlation function, (e.g., Rasmussen and Williams, 2006), The observation error covariance matrix can be generated using the correlation functions by where D is a diagonal matrix whose diagonal elements are a prescribed standard deviation. For our experiments, we choose the standard deviation as one and the correlation lengthscales as l = 80, 160, 240 km. These correlation lengthscales are selected based on the horizontal error correlations estimated for geostationary satellite data used in operational km-scale data assimilation (Waller et al., 2016b) and for AMVs (Cordoba et al., 2017). It should be noted that the Markov and Matérn 5/2 correlation functions may lead to a sparse A, in which case the SVD-FMM may not be optimal to use. We use these correlation functions in our experiments because we wish to show that the SVD-FMM can be applied to a variety of matrices.

Reconditioning of the observation error covariance matrix
In practical applications, observation error covariance matrices are often ill-conditioned (Tabeart et al., 2020b;Haben et al., 2011a). Moreover, if the matrix R has a large condition number, this can lead to poor convergence of the minimisation problem in variational data assimilation (Weston et al., 2014;Tabeart et al., 2018Tabeart et al., , 2020a. In our applications, the condition number of the matrices that are created using correlation functions is dependent on the chosen function and correlation lengthscale (Haben, 2011;Haben et al., 2011b). We use two common matrix reconditioning techniques to reduce the matrix condition number in our experiments: the ridge regression method and the minimum eigenvalue method (Tabeart et al., 2020b;Weston et al., 2014;Campbell et al., 2017;Bormann et al., 2016). In the ridge regression (RR) method the reconditioned matrix is given by where I is the identity matrix and where λmax is the maximum eigenvalue of R, λmin is the minimum eigenvalue of R and κreq is the required new condition number. The minimum eigenvalue (ME) method changes the eigenvalue spectrum of the matrix R by setting all eigenvalues smaller than a threshold value to the threshold value. The threshold value (T ) is given in terms of the required condition number as The reconditioned matrix is then constructed via where E is a square matrix whose columns are the eigenvectors of R and ΛME is a diagonal matrix with the diagonal elements being the new eigenvalues. We carried out an experiment to assess the accuracy of the SVD-FMM, as the number of singular vectors (p) changes. As discussed in section 3.4, we expect the accuracy to depend on both p and the (p + 1)th singular values for each sub-matrix. Fig. 8 provides the log(RMSE) for the SVD-FMM with the FOAR and SOAR covariance matrices as a function of p. As anticipated, the log(RMSE) decreases, and hence the accuracy increases, as more singular vectors are used. Moreover, Fig. 9 demonstrates that the log(RMSE) for the SVD-FMM using p singular vectors has an approximately linear relationship with the log of the mean (p + 1)th singular value of the sub-matrices, which is given by

The number of singular vectors and the size of singular values
for p = 1, · · · , 9 and b = 4, · · · , 83. Additionally, we note that the SVD-FMM with the FOAR covariance matrix is more accurate than with the SOAR covariance matrix for all the values of p considered. This is because the mean (p + 1)th singular value of the sub-matrices of the FOAR matrix is consistently smaller than that of the SOAR matrix (Fig. 9). We have also carried out several other experiments using different correlation lengthscales, different correlation functions and reconditioned covariance matrices and found similar qualitative results: the accuracy of the SVD-FMM using p singular vectors relies on the (p + 1)th singular values of the sub-matrices.  Figure 9: The log(RMSE) for the SVD-FMM using p singular vectors against the log of mean (p + 1)th singular value of the sub-matrices of the inverse FOAR and SOAR covariance matrices used in Fig. 8.

SOAR Covariance Matrix
These results using the mean singular values of the sub-matrices provide some guidance as to how to set the value of p in a given application. However, as they require the computation of the SVDs of all of the relevant sub-matrices of A, they are time-consuming to compute. Different observations can exhibit correlated errors over different lengthscales. To examine how correlation length affects the accuracy of the SVD-FMM, we use three correlation lengthscales for the SOAR correlation function: l = 80, 160 and 240 km. Fig. 10 reveals an increase of the log(RMSE) of the SVD-FMM with correlation lengthscale. Nevertheless, we still only need a few singular vectors to obtain a small log(RMSE). However, if we desire to obtain a given accuracy, we may need to use a larger value of p for longer lengthscales. In addition, we note that the magnitude of the elements of A increases as correlation lengthscale increases. For a smaller lengthscale, the far-field elements of A are close to zero and could be neglected to give a sparse matrix. Moreover, compared to the SOAR correlation function, the Matérn and Markov functions are more likely to give a sparse A if the correlation lengthscale is small.

Varying correlation lengthscale
The variation of the accuracy of the SVD-FMM with correlation lengthscale can also be explained by the singular values of the sub-matrices. We find in our numerical experiments (not shown) that the singular values of the sub-matrices are smaller when the correlation lengthscale is smaller. It is known that the maximum singular value of the full matrix (inverse SOAR covariance matrix) also decreases as the correlation lengthscale reduces (Haben, 2011, section 5.3.2). Therefore, we investigated whether it would be possible to use the singular values of the full matrix to estimate the accuracy of the SVD-FMM. Thompson (1972) showed that the singular values of the sub-matrices are bounded by the singular values of the full matrix, i.e., σi(B) ≤ σi(A), i = 1, 2, 3, . . . , min{α, β} where B ∈ R α×β denotes a sub-matrix of A, σi(B) denotes the i-th singular value of B and σi(A) denotes the i-th singular value of A. Equality can be obtained for some choices of B. However, since the dimensions of the sub-matrices used in SVD-FMM are much smaller than the dimensions of the full matrix, our numerical experiments showed that σi(B) is typically much smaller than σi(A) in practice. We compared the maximum singular values of different full matrices numerically and found that they could provide a rough guide to relative accuracies: for a given value of p, the best accuracy was obtained for the full matrix with the smallest maximum singular value.  Figure 11: As Fig. 8, but matrices A are given by inverting reconditioned SOAR covariance matrices with correlation lengthscale l = 80 km. The ridge regression (RR) and minimum eigenvalue (ME) methods are used to reduce the condition number of SOAR covariance matrix to target values (κ req ).

Reconditioned covariance matrices
The observation error covariance matrices used in data assimilation procedures are often ill-conditioned and thus, in order to invert them we need to reduce their condition numbers using proper techniques (Tabeart et al., 2020b). Reconditioning the covariance matrices may also improve the convergence behaviour of the minimisation procedure (Weston et al., 2014;Tabeart et al., 2020aTabeart et al., , 2021. We evaluate how reconditioning affects the accuracy of the SVD-FMM; we reduce the condition number of the SOAR covariance matrix with l = 80 km to three required new condition numbers (κreq = 1000, 2000 and 3000) using the RR and ME methods. Fig. 11 shows that reconditioning the SOAR correlation matrix can improve the accuracy of SVD-FMM; the smaller the condition number of the SOAR covariance matrices after reconditioning, the better the accuracy.
We also find that the RR method gives smaller log(RMSE) than the ME method for the same required condition number. This difference is caused by the different ways the two reconditioning methods reduce the condition number, which leads to different size singular values of the full inverse covariance matrices satisfying: σmax(ARR) < σmax(AME), where σmax(ARR) and σmax(AME) denote the maximum singular value of the reconditioned matrices obtained using the RR and ME methods, respectively. Combing Eq. (30) and Eq. (31), the leading singular values of the sub-matrices of ARR are expected to typically be smaller than the leading singular values of the sub-matrices of AME. Hence, the SVD-FMM with reconditioned matrices using RR method will usually have smaller error. A derivation of Eq. (31) is presented in Appendix.  Figure 12: As Fig. 8, but matrices A are given by inverting reconditioned Gaussian, FOAR, SOAR and Matern covariance matrices with a correlation lengthscale of 80 km. All covariance matrices are reconditioned to have a new condition number of 1000 using the RR method prior to inversion.

Varying correlation functions
To provide more numerical evidence for the wide applicability of the SVD-FMM, we use other correlation functions than SOAR and FOAR to create covariance matrices. To allow for comparison with Fig.  11 we use the RR method to reduce the condition number of all the matrices to 1000 before inverting them. Fig. 12 shows that the SVD-FMM can work well with the inverses of the matrices given by different correlation functions. We note that although the condition number of each matrix is identical after reconditioning, the accuracy of the SVD-FMM still relates to the original condition numbers; the larger the condition number prior to reconditioning, the greater the log(RMSE). In our experiments the FOAR correlation matrix has a condition number on the order of thousands, while the Gaussian correlation matrix is extremely badly conditioned and has a condition number on the order of 10 15 prior to reconditioning. By comparing Fig. 12 with Fig. 11, we also observe that reducing the condition number of the Gaussian correlation matrix to 1000 using the RR method gives larger log(RMSE) than reducing the condition number of the SOAR correlation matrix to 3000 using the same method. Therefore, to acquire a comparable accuracy using the SVD-FMM with the same value of p, we may need to reduce the condition number of two matrices to different values. The matrix with a larger condition number prior to reconditioning may need a smaller condition number after reconditioning.

Removing a portion of observations
In operational data assimilation, the number of observations varies each assimilation cycle due to quality control, among other reasons. This will lead to a decrease of the dimension of covariance matrix and disrupt the structure of the matrix. The resultant covariance matrix lacks some of the rows and columns of the original one. The missing observations may cause a problem if a leaf box becomes empty or contains fewer observations than p. This could be solved by using a different partition of observations. In our experiment, we randomly select the locations of the missing observations, and with up to 25% missing observations, every leaf box always contains enough observations. Fig. 13 shows that the SVD-FMM can still work well with missing observations without losing accuracy. In our experiments using an inverse SOAR covariance matrix, we find that the missing observations actually lead to a slight decrease in the log(RMSE) compared to the full set of observations. The difference in the log(RMSE) between removing 10% of the observations and removing 25% of the observations is not statistically significant.

Conclusion and discussion
Some observations have been shown to exhibit strong spatial error correlations, e.g., Doppler radar radial winds, geostationary satellite data, and AMVs. Accounting for this information in data assimilation systems can improve analysis accuracy and forecast skill. However, it can make the computation of the products of observation weighting matrices and observation-minus-model departure vectors very expensive, in terms of not only the computational complexity, but also the communication costs in parallel computing.
We have therefore investigated the use of the SVD-FMM for the rapid computation of these matrix-vector products. This numerical approximation method is best suited for full or dense precision matrices. If the observation weighting matrix is sparse it might be appropriate to only compute the near-field calculation. Moreover, the SVD-FMM is suitable for use for large problems, when the overhead of computing the SVD-FMM expansions is outweighed by the reductions in floating-point operations and communication costs compared with alternative approaches, and where each PE has enough memory to store its part of the error covariance matrix of these observations. The exact range of the number of observations that makes the SVD-FMM useful depends on the computer architecture and the constraints of the operational schedule.
We proposed a novel possible parallelization scheme for the SVD-FMM in our application. In comparison to the parallel formulations of direct matrix-vector multiplication (section 2), the parallelization scheme for the SVD-FMM largely reduces the size of the messages transferred. This reduces communication costs, making the use of full observation error covariance matrices associated with large spatial extents potentially feasible in operational data assimilation.
In our idealised experiments we have examined the accuracy of the SVD-FMM, in terms of applying it to the inverses of various observation error covariance matrices. These matrices are created using commonly used correlation functions, such as Gaussian and SOAR correlation functions, and different correlation lengthscales. In some of our experiments, we have applied reconditioning methods to the covariance matrices before inverting them. We have also investigated the feasibility of the SVD-FMM with randomly chosen missing observations. We find consistent results as discussed in section 3.4: i) the accuracy of the SVD-FMM increases as more singular vectors are used and ii) the variation of the accuracy is approximately linear with the mean spectrum of singular values of the sub-matrices used in the approximation. Furthermore, our experiments indicate that a comparison of the maximum singular values of the full inverse matrices themselves can be used as a rough guide to determine which matrices will give more accurate results with the SVD-FMM.
The most computationally expensive parts of our current implementation of the SVD-FMM are the inversions of covariance matrices and the SVDs of the sub-matrices of inverse matrices. They need to be reperformed every time the observation error covariance matrix is changed. This happens when the number of observations or the distribution of observations are changed. Nevertheless, our work has shown that the SVD-FMM has potential for use in operational data assimilation for fast computation of the products of the inverse observation error covariance matrices and observation-minus-model departure vectors. This will allow a large volume of observational data to be assimilated within a short time interval, which is particularly important for applications such as convection-permitting hazardous weather forecasting.
The reciprocal of the minimum eigenvalue of a matrix is the maximum eigenvalue of its inverse and hence we have λmax(ARR) = 1 λmin(RRR) < 1 λmin(RME) = λmax(AME).