Scalability analysis of a two level domain decomposition approach in space and time solving data assimilation models

We are concerned with the mapping on high performance hybrid architectures of a parallel software implementing a two level overlapping domain decomposition, that is, along space and time directions, of the four dimensional variational data assimilation model. The reference architecture belongs to the SCoPE (Sistema Cooperativo Per Elaborazioni scientifiche multidisciplinari) data center, located at University of Naples Federico II. We consider the initial boundary problem of the shallow water equation and analyse both strong and weak scaling. Keeping the efficiency always greater than 60%$$ 60\% $$ and about 90%$$ 90\% $$ in most cases, we experimentally find that the isoefficiency function grows a little more than linearly with respect to the number of processes. Results, obtained by using the parallel computing toolbox of MATLABR2013a, are in agreement with the algorithm's performance prevision based on the scale up factor, confirming the appropriate mapping of the algorithm on the hybrid architecture.

When the forward model takes the form of partial differential equations (PDEs) or some other expensive model, the result is a PDE-based variational problem that may be extremely large scale in the state variables, even when the number of inversion parameters is small.More generally, uncertain parameters can be taken from numbers on a continuum (such as initial or boundary conditions, heterogeneous material parameters, or heterogeneous sources) that, when discretized, result in an inverse problem that is very large scale in the inversion parameters as well.An estimation of parameters using the regularization approach to inverse problems as described above will yield an estimate of the "best" parameter values that minimize the combined misfit and penalty function.
In this article, we are concerned with the parallel implementation on high performance hybrid architectures of PDE-based variational problems modeling four dimensional variational data assimilation problems (DD-4DVAR). 1,2The majority of parallelization techniques first decompose domain in nonoverlapping subdomains then extend each subdomain with halo regions to enable stencil operations in a parallel context.For instance, parallelization technique applied to Regional Ocean Modeling System (ROMS) 3,4 and Nucleus for European Modelling of the Ocean (NEMO) 5,6 needed respectively a halo of width two (or three for advection scheme) and one grid nodes.As a consequence, the solution may suffer of local instabilities.Instead, one of the main features of DD-4DVAR is the introduction of a sort of regularization constraint to the decomposition approach: in order to enforce the matching of local solutions on overlapping regions, local DA functionals are slightly modified by adding a correction term.Such a correction term balances localization errors and computational efforts, acting as a regularization constraint on local solutions.9][10] As is known, the appropriate mapping of any parallel algorithm depends upon both the specification of the algorithm and the underlying architecture. 11,12There are a lot of scientific groups that are working on workload distribution, data distribution, communication performance models and optimization of communication operations of parallel or distributed algorithm on the network.One common issue of the aforementioned approaches is that very often the model used for the representation and analysis of the algorithm cannot be explicitly employed for deriving the expression of performance metrics of the software.Furthermore, performance analysis is often accomplished on a combination of the algorithm and the parallel architecture on which it is implemented (the so-called parallel system), exploiting automating mappings, automatic translations, auto-tuning tools.The main contribution of the present work is to employ a different approach to the performance evaluation.In particular, in Reference 2 the authors used the scale-up factor to analyze the capability of the two level domain decomposition approach, that is, along space and time directions, to be concurrently implemented.As a hybrid computing architecture was used, in order to facilitate shared/distributed communication of cores/blades processing unit, we assign nearby domains that communicate often to shared memory.It means that DD in space is implemented as the finest level of the computer architecture (the shared memory cores), while DD in time is implemented at the first level of the decomposition graph, requiring inter-node communications.The predictive performance study carried out in Reference 2 proved that the scale-up factor of the parallel algorithm should grow of about 1.5.Instead, the software performance analysis employed in this work mainly focuses on the mapping on the reference hybrid architecture.Besides supporting the predictive performance study, this analysis highlights the software communication overhead due to the two level domain decomposition approach.This is essentially introduced by the communication (across time) among distributed memory processing units occurring at the highest level of the domain decomposition (see Figure 5).Hence, the software communication overhead is obtained by measuring the gap between the scale up factor and the isoefficiency function. 13e article is organized as follows.In Section 2, we will start from the high-level problem definition of the DA problem in terms of a set of mathematical equations subject to boundary and initial conditions over some space-time domain where the spatial domain is expressed in terms of a mathematical definition.From this point, in Section 3 we briefly resume the two level overlapping domain decomposition approach, and its numerical discretization.We specify why the domain decomposition is named two-level as it performs first the decomposition along time direction and then along the space direction.The resulting solution algorithm will be analyzed in Section 3.1 to determine if the combination of the mesh and its discretization yielded the desired accuracy.Finally, in Section 4 software scalability on the shallow water equations is discussed in details.
Conclusions are presented in Section 5.

THE DA PROBLEM
be a symbolic description of the model of interest where is the state function of the model  with p v ∈ N the number of physical variables, f is a known function defined on the boundary Ω, and let be the observations function, and denote the non-linear observations mapping.To simplify future treatments we assume p v ≡ 1.
The observations y are in general not direct measurements of state variable u  , the observation operator  t allows to compare observations vector with the state vector.We define the DA inverse problem corresponding to the discretization of (1). 2,14finition 1. DA problem concerns the computation of subject to the constraint that where N p is the number of nodes x j in Ω ⊂ R n ; n obs is the number of observations in Ω, where n obs << N p ; N is the number of instants of time in Δ; R n obs ×N representing observations in Ω × Δ; H l ∈ R n obs ×N p , l = 0, … , N − 1 is a linear approximation of observation mapping ×N p is defined foe each l as follows Definition 2. The 4DVAR DA problem concerns the computation of with where representing the numerical solution of , that is, the background;  is the regularization parameter; R N⋅n obs ×N⋅n obs and B= VV T ∈ R N p ×N p are respectively covariance matrices of the errors on observations and background.Finally, it is

TWO LEVEL OVERLAPPING DOMAIN DECOMPOSITION APPROACH
Main approaches for delivering scalable solutions of simulations based on DA methods integrated with a PDE-based model essentially only takes full advantage of existing parallel PDE solvers, and in particular these are based on domain decomposition (DD) methods in space 15 , where the DD-solver is suitably modified to also handle the adjoint system.While this scheme is efficient, it has a limited scalability, due to the strong synchronization between the PDE integration and the DA solver.Parallel in Time (PinT) approaches (the literature is huge, we cite only [16][17][18] ) provide a new avenue to achieve scaling on new generation computing environments.The core of our PinT-based algorithm is that the PDE and DA models are tightly coupled so that, once the whole space and time domain is decomposed (by using PinT-based approaches for decomposing the PDE model and variational calculus for decomposing the DA functional), DA model acts as coarse/predictor of the local PDE model, by providing the background values as initial conditions of the local PDE model.Moreover, in contrast to other PinT-based approaches, in our approach, local solvers run concurrently, so that the resulting algorithm only requires exchange of boundary conditions between adjacent subdomains.In this way it leads to an "adaptive composition of local solvers" in which, following a tree configuration manner, its customization varies from sub-problem to sub-problem.Finally, it is worth to note that the proposed approach is non-intrusive, allowing the incremental transition of existing software (as for instance, the Regional Ocean Modelling System-ROMS). 19As depicted in Figure 1, the proposed approach consists in decomposing the domain of computation Ω × Δ into subdomains in space and time and solving reduced forecast models and local 4DVAR DA problems as described in Reference 2. In particular, the modules of DD method proposed in Reference 2 are: domain decomposition of Ω × Δ, model reduction, additive Schwarz method (ASM), 20 DD-4DVAR local solution and DD-4DVAR global solution.The flowchart * of DD-4DVAR algorithm is reported in Figure 2.

The algorithm
In the following, we describe each module of the DD-4DVAR algorithm schematically depicted in Figure 2.
DD. DD of Ω × Δ consists in the decomposition of Ω ⊂ R n into a sequence of subdomains Ω i such that: and in the definition of the set of indices of the subdomains which are adjacent to Ω i , together with its cardinality.We have where ad i is the number of subdomains adjacent to Ω i .For i = 1, … , N sub , we define the overlap regions Ω ij as F I G U R E 1 Graphical description of the domain decomposition approach both in space (along the vertical axes) and time (along the horizontal axes).For clarity, we choose to describe the space domain as a 1D dimensional interval.and interfaces of Ω i , as Decomposition of Δ ⊂ R consists of a sequence of intervals Δ k such that: Consequently, we refer to as local domains.Consistently with the space decomposition we define the DD of Ω i : it consists in the identification of the inner nodes of Finally, for i = 1, … , N sub we introduce the set of indices of nodes belonging to the overlap region where and is its cardinality.As a consequence, we get that Consequently, for i = 1, … , N sub the cardinality of I i is: that is, the number of inner nodes of Ω i .In Figures 3 and 4, we depict two examples of DD of Ω in one dimension and two dimensions, respectively.
Overlap region is Ω 12 and interfaces are Γ 12 and Γ 21 .On the left  = 0, that is, no inner nodes are in Ω 12 , while on the right  = 2, that is, there are two inner nodes in the overlap region Ω 12 .
F I G U R E 4 Decomposition of domain Ω ⊂ R 2 in four subdomains {Ω i } i=1,2,3,4 by identifying for i = 1, 2, 3, 4 overlap regions Ω ij and interfaces DD of Δ k , for k = 1, … , N t , consists of the identification of time values in {Δ k } k=1, … ,N t .We introduce the set where If K is the set containing the indices of the inner points of Δ, then: and Finally, is the cardinality of K k .Finally, for i = 1, … , N sub , and k = 1, … , N t local discrete domains are In order to compute local solutions in local domains we introduce the following notation.
that is, this is a vector defined in Ω i × Δ k .Now, we are able to define the restriction and extension operator (EO).
Definition 3 (Restriction operator).For i = 1, … , N sub we define restriction matrices RO i and RO ij in Ω i and Ω ij , respectively.Given x ∈ R N p ×N and z ∈ R N p ×1 we define restriction of x to Ω i : and restriction of y to where I i and I ij are respectively set of indices of inner nodes in Ω i and Ω ij , ∀j ∈ J i .
where R T i is the transpose of R i in (18) and EO(x) ≡ x EO .

Model reduction.
We use a reduced model to define the coarse solver.Let parameter n denote the outer loop counter as described in Figure 2. ) i=1 … ,N sub ,k=1, … ,N t : where u n i,k and b n i,k are respectively the background in Ω i × Δ k and the vector accounting boundary conditions.Such data depend on the discretization scheme used.Finally, if where and are respectively the first index of Δ K K and Δ K 1 , then we let be the restriction to Ω i of M k . Let: be the local 4DVAR DA model with We let be the overlapping operator in Ω ij , and be the restriction of J in Ω i × Δ k .Parameters  i,k and  j in (26) denote the regularization parameters.Following, we let  i,k =  j = 1, for j ∈ J i .
ASM: as in any PinT approach, it is important that the coarse solver be accurate enough so that the iterations converge rapidly.To this aim we employ the ASM 20 to compute solution of (P n i,k ) i=1, … ,N sub ,k=1, … ,N t .Gradient of J i,k is 14 : where and I i is the identity matrix, ad i is the number of subdomains adjacent to Ω i defined in (9).Solution of (P n i,k ) i=1, … ,N sub ,k=1, … ,N t is obtained by requiring ∇J i,k (w n i,k ) = 0.This requirement gives rise to the linear system: where For each n, r.h.s. of (29) depends on unknown value w n j,k defined in sub domains Ω ij adjacent to Ω i , where j ∈ J i .Hence, for solving the system in (29) we introduce the inner-loop related to the minimization of local 4DVAR model defined in (24).Let parameter r denote the inner loop counter as described in Figure 2.For r = 0, 1, … , r we solve the linear system where, at each step r + 1, Ω i receives w r,n i,k from Ω ij , with j ∈ J i for computing the r.h.s. of (31), then it computes w r+1,n i,k by using conjugate gradient (CG) method, and finally it sends w r+1,n i,k to Ω ij , where j ∈ J i for updating the r.h.s. of (31) needed to the next iteration.w 0,n i j ,k is an arbitrary initial value.Finally, we pose DD-4DVAR local solution in Ω i × Δ k .Solution update, using ( 19) and ( 28): DD-4DVAR global solution in Ω × Δ. Computation of the numerical solution of DD-4DVAR in Ω × Δ: be DD-4DVAR solution at step n where (u n i,k ) EO is the extension to Ω × Δ of DD-4DVAR numerical solution in as the numerical solution of DD-4DVAR in Ω × Δ.

Note that
to Ω i and to Γ ij , for i = 1, 2, … , N sub , and j ∈ J i .It is worth noting that each local functional J i,k is obtained starting from a restriction of the global functional J adding a local term defined in the overlapping regions.Regarding the decomposition in time direction, we use DA as predictor operator for the local PDE-based model, providing those values needed for solving the initial value problem in each sub interval.

SOFTWARE MAPPING
We consider the initial boundary problem of the SWEs introduced in Reference 2. The discrete model comes from Lax-Wendroff scheme. 21We consider the following experimental set up.
10. proc: number of processing units; 11.N sub : number of subdomains of Ω; 12. N t = 2: number of sub intervals of Δ;  2 Intel Xeon@2.33GHzquadcore processors sharing the same local 16 GB RAM memory for a number of 8 cores per blade and of 64 total cores.We study the performance by using parallel computing toolbox of MATLABR2013a.As described in Figure 5, we implement a two level DD in time and in F I G U R E 5 Two level DD in time and in space.Especially as a hybrid computing architecture was used, in order to facilitate shared/distributed communication of cores/blades processing unit, we assign nearby domains that communicate often to shared memory.It means that DD in space is implemented as the finest level of the computer architecture (the shared memory cores), while DD in time is implemented at the first level of the decomposition graph, requiring inter-node communications.
space.Especially as a hybrid computing architecture was used, in order to facilitate shared/distributed communication of cores/blades processing unit, we assign nearby domains that communicate often to shared memory.It means that DD in space is implemented at the cores level.
We use the scaling efficiency which is classified as strong and weak scaling.
1. Strong scaling.The strong scaling indicates that for a fixed problem size, with an increasing number of processors, the scaling is linear.
In Tables 1-4 and Figures 6-8 we fix  = 2 and analyze the speed up and the efficiency of the algorithm for different values of N p , that is, in each table the spatial mesh size is fixed.As expected, as the DD size increases (and then proc = N sub × N t increases), once the problem size is fixed, the efficiency strongly decreases.The decay is ever more stronger rising from N sub = 32 to N sub = 64, and the reason is that in the last case there are 128 local subdomains mapped on 64 cores thus increasing the inter-node communication.
2. Weak scaling.Weak scaling indicates the efficiency is constant with an increasing problem size.In Table 5, we report the efficiency values obtained by increasing the problem size, that is, increasing N p .We note that as N sub doubles, in order to achieve a constant efficiency, N p has to increases with a rate slightly greater than 1.The increasing rate will be given by analyzing the isoefficiency function.

Isoefficiency.
In Figure 9, we plot the curve found by using the MATLAB toolkit and giving the best fitting of points (N sub , N p ), as reported in Table 5.In particular, the equation of such curve is The function in (37) describes how N p should grow in order to keep constant the efficiency of the DD-4DVAR algorithm, that is, it is the so called isoefficiency function.From (37), observing that a = O(10   2.
For N sub = 64, speed up and efficiency strongly decrease.5.
It is interesting to underline that such result strengthens the predictive performance study carried out in Reference 2 on the parallel algorithm.In particular, we remind that in Reference 2, it was proved that the scale-up factor of the parallel algorithm was about 1.5.This result can be seen, in some sense, as the qualitative measure of the suitability of the domain decomposition approach to parallel computing.Since the isoefficiency function measures the effectiveness of the software mapping, and as we found that the scale up factor and the isoefficiency function are in agreement, this result confirms the appropriate mapping of the parallel algorithm on the hybrid computing architecture.In addition, the small gap between the scale up factor and the isoefficiency function really shows up how small it is the software communication overhead introduced by the mapping on the specific architecture.Finally, we note that, by restricting the overlapping region to  = 2, for example, the

F
I G U R E 2 Schematic description of DD-4DVAR algorithm.The modules of DD-4DVAR: DD, Model Reduction, ASM, DD-4DVAR local solution and DD-4DVAR global solution are identified and the Arabic numbers in parentheses refer to the corresponding modules as described in Section 1.In particular, in Section 1, we use the parameter n for denoting the outer loop counter, while r is the inner loop counter.The outer-loop solves the nonlinear aspects of the assimilation problem which for DD-4DVAR and in general for any 4DVAR techniques, includes the integration of the nonlinear model.For each module we report the corresponding solution.

20 .
S proc (N loc , N) ∶= T 1 (Np,N) T proc (N loc ,N) : speed-up of the parallel algorithm; 21.E proc (N loc , N) ∶= S proc (Nloc,N) p : efficiency of the parallel algorithm.Validation of the parallel software is performed on the high performance hybrid computing architecture of the SCoPE (Sistema Cooperativo Per Elaborazioni scientifiche multidisciplinari) data center, located at University of Naples Federico II.The architecture is composed by 8 nodes consisting of distributed memory DELL M600 blades.The blades are connected by a 10 Gigabit Ethernet technology and each of them is composed of ) where a = −9.68 × 10 −2 , b = 1.24 × 10 1 , c = 6.25 × 10 2 .

TA B L E 4 × 10 − 1 F
Performance results at N p = 1024 and  = 2. T 1 (N p , N) = 1.59 × 10 2 .N sub N loc T proc (N loc , N) S proc (N loc , N) E proc (N loc , N) I G U R E 6 Performance results at N p = 640 for N sub = 2, 4, 8.The values of speed up (left) and efficiency (right) are reported in Table 1.For N sub = 16, speed up and efficiency strongly decrease.F I G U R E 7 Performance results at N p = 832 for N sub = 16, 32, 64.The values of speed up (left) and efficiency (right) are reported in Table

F I G U R E 8 × 10 − 1 F I G U R E 9
Performance results at N p = 896 for N sub = 32, 64.The values of speed up (left) and efficiency (right) are reported in Table 3.For N sub = 64, speed up and efficiency strongly decrease.TA B L E 5 Isoefficiency.N sub N p E proc (N loc , N) Isoefficiency curve: The best fitting of data (N sub , N p ) reported in Table N sub , and k = 1, … , N t , posed u 0 i,k ∶= {u M (x̃i, t̃k)}̃i ∈I ij , k∈K k , that is, by using the background as local initial values, let u number of subdomains adjacent to Ω 1 , Ω Nsub and to Ω 2 , Ω 3 , … , Ω nsub−1 , respectively; 14. : number of inner nodes belonging to the overlap regions; 15.N loc : number of inner nodes of the domain decomposition; 16.Δx and Δt: spatial and temporal step sizes; 17.C ∶= {c i,j } i,j=1, … ,N p ∈ R N p ×N p : the Gaussian correlation structure of the model error wherec i,j =  |i−j| 2 ,  = exp T proc (N loc , N): execution time (in seconds) of the parallel algorithm; Performance results at N p = 640 and  = 2. T 1 (N p , N) = 6.04 × 10 0 .Performance results at N p = 832 and  = 2. T 1 (N p , N) = 2.97 × 10 1 .Performance results at N p = 896 and  = 2. T 1 (N p , N) = 5.90 × 10 1 .
−2 )we can infer that, to get a constant efficiency, N p needs to increase a little more than linearly with N sub .TA B L E 1N sub N loc T proc (N loc , N) S proc (N loc , N) E proc (N loc , N)