MILAMIN: MATLAB-based finite element method solver for large problems



The finite element method (FEM) combined with unstructured meshes forms an elegant and versatile approach capable of dealing with the complexities of problems in Earth science. Practical applications often require high-resolution models that necessitate advanced computational strategies. We therefore developed “Million a Minute” (MILAMIN), an efficient MATLAB implementation of FEM that is capable of setting up, solving, and postprocessing two-dimensional problems with one million unknowns in one minute on a modern desktop computer. MILAMIN allows the user to achieve numerical resolutions that are necessary to resolve the heterogeneous nature of geological materials. In this paper we provide the technical knowledge required to develop such models without the need to buy a commercial FEM package, programming compiler-language code, or hiring a computer specialist. It has been our special aim that all the components of MILAMIN perform efficiently, individually and as a package. While some of the components rely on readily available routines, we develop others from scratch and make sure that all of them work together efficiently. One of the main technical focuses of this paper is the optimization of the global matrix computations. The performance bottlenecks of the standard FEM algorithm are analyzed. An alternative approach is developed that sustains high performance for any system size. Applied optimizations eliminate Basic Linear Algebra Subprograms (BLAS) drawbacks when multiplying small matrices, reduce operation count and memory requirements when dealing with symmetric matrices, and increase data transfer efficiency by maximizing cache reuse. Applying loop interchange allows us to use BLAS on large matrices. In order to avoid unnecessary data transfers between RAM and CPU cache we introduce loop blocking. The optimization techniques are useful in many areas as demonstrated with our MILAMIN applications for thermal and incompressible flow (Stokes) problems. We use these to provide performance comparisons to other open source as well as commercial packages and find that MILAMIN is among the best performing solutions, in terms of both speed and memory usage. The corresponding MATLAB source code for the entire MILAMIN, including input generation, FEM solver, and postprocessing, is available from the authors ( and can be downloaded as auxiliary material.

1. Introduction

Geological systems are often formed by multiphysics processes interacting on many temporal and spatial scales. Moreover, they are heterogeneous and exhibit large material property contrasts. In order to understand and decipher these systems numerical models are frequently employed. Appropriate resolution of the behavior of these heterogeneous systems, without the (over)simplifications of a priori applied homogenization techniques, requires numerical models capable of efficiently and accurately dealing with high-resolution, geometry-adapted meshes. These criterions are usually used to justify the need for special purpose software (commercial finite element method (FEM) packages) or special code development in high-performance compiler languages such as C or FORTRAN. General purpose packages like MATLAB are usually considered not efficient enough for this task. This is reflected in the current literature. MATLAB is treated as an educational tool that allows for fast learning when trying to master numerical methods, e.g., the books by Kwon and Bang [2000], Elman et al. [2005], and Pozrikidis [2005]. MATLAB also facilitates very short implementations of numerical methods that give overview and insight, which is impossible to obtain when dealing with closed black-box routines, e.g., finite elements on 50 lines [Alberty et al., 1999], topology optimization on 99 lines [Sigmund, 2001], and mesh generation on one page [Persson and Strang, 2004]. However, while advantageous from an educational standpoint, these implementations are usually rather slow and run at a speed that is a fraction of the peak performance of modern computers. Therefore the usual approach is to use MATLAB for prototyping, development, and testing only. This is followed by an additional step where the code is manually translated to a compiler language to achieve the memory and CPU efficiency required for high-resolution models.

This paper presents the outcome of a project called “MILAMIN - MILlion A MINute” aimed at developing a MATLAB-based FEM package capable of preprocessing, processing, and postprocessing an unstructured mesh problem with one million degrees of freedom in two dimensions within one minute on a commodity personal computer. Choosing a native MATLAB implementation allows simultaneously for educational insight, easy access to computational libraries and visualization tools, rapid prototyping and development, as well as actual two-dimensional production runs. Our standard implementation serves to provide educational insight into subjects such as implementation of the numerical method, efficient use of the computer architecture and computational libraries, code structuring, proper data layout, and solution techniques. We also provide an optimized FEM version that increases the performance of production runs even further, but at the cost of code clarity.

The MATLAB code implementing the different approaches discussed here is available from the authors ( and can be downloaded as auxiliary material (see Software S1).

2. Code Overview

A typical finite element code consists of three basic components: preprocessor, processor, postprocessor. The main component is the processor, which is the actual numerical model that implements a discretized version of the governing conservation equations. The preprocessor provides all the input data for the processor and in the present case the main work is to generate an unstructured mesh for a given geometry. The task of the postprocessor is to analyze and visualize the results obtained by the processor. These three components of MILAMIN are documented in the following sections.

3. Preprocessor

Geometrically complex problems promote the use of interface adapted meshes, which accurately resolve the input geometry and are typically created by a mesh generator that automatically produces a quality mesh. The drawback of this approach is that one cannot exploit the advantages of solution strategies for structured meshes, such as operator splitting methods (e.g., ADI [Fletcher, 1997]) or geometric multigrid [Wesseling, 1992] for efficient computation.

A number of mesh generators are freely available. Yet, none of these are written in native MATLAB and fulfill the requirement of automated quality mesh generation for multiple domains. DistMesh by Persson and Strang [2004] is an interesting option as it is simple, elegant, and written entirely in MATLAB. However, lack of speed and proper multidomain support renders it unsuitable for a production code with the outlined goals. The mesh generator chosen is Triangle developed by J. R. Shewchuk (version 1.6,∼quake/triangle.htmlShewchuk, 2007). Triangle is extremely versatile and stable, and consists of one single file that can be compiled into an executable on all platforms with a standard C compiler. We choose the executable-based file I/O approach, which has the advantage that we can always reuse a saved mesh. The disadvantage is that the ASCII file I/O provided by Triangle is rather slow, which can be overcome by adding binary file I/O as described in the instructions provided in the MILAMIN code repository.

4. Processor

4.1. FEM Outline

In this paper we show two different physical applications of MILAMIN: steady state thermal problems and incompressible Stokes flow (referred to as mechanical problem). This section provides an outline of the governing equations and their corresponding FEM formulation. The numerical implementation and performance discussions follow in subsequent sections.

4.1.1. Thermal Problem

The strong form of the steady state thermal diffusion in the two-dimensional domain Ω is

equation image

where T is temperature, k is the conductivity, x and y are Cartesian coordinates. The boundary Γ of Ω is divided into two nonintersecting parts: Γ = ΓN ∪ ΓD. Zero heat-flux is specified on ΓN (Neumann boundary condition) and temperature equation image is prescribed on ΓD (Dirichlet boundary condition).

The FEM is based on the weak (variational) formulation of partial differential equations, taking an integral form. For the purpose of this paper we only introduce the basic concepts of this method that are important from an implementation viewpoint. A detailed derivation of the finite element method and a description of the weak formulation of PDEs can be found in textbooks [e.g., Bathe, 1996; Hughes, 2000; Zienkiewicz and Taylor, 2000].

In FEM, the domain Ω is partitioned into nonoverlapping element subdomains Ωe, i.e., Ω = equation image Ωe, where nel denotes the number of elements. The basic two-dimensional element is a triangle. In the thermal problem discrete temperature values are defined for the nodal points, which can be associated with element vertices, located on its edges, or even reside inside the elements. Introducing shape functions Ni that interpolate temperatures from the nodes Ti to the domains of neighboring elements, an approximation to the temperature field equation image in Ω is defined as

equation image

where nnod is the number of nodes in the discretized domain.

On the basis of the weak formulation that takes the form of an integral over Ω, the problem can now be stated in terms of a system of linear equations. From a computational point of view it is beneficial to evaluate this integral as a sum of integrals over each element Ωe. A single element contribution, the so-called “element stiffness matrix,” to the global system matrix in the Galerkin approach for the thermal problem is given by

equation image

where ke is the element specific conductivity. Note that the shape function index in equation (3) corresponds to local numbering of element nodes and must be converted to global node numbers before element matrix Ke is assembled into the global matrix K.

4.1.2. Mechanical Problem

The strong form of the plane strain Stokes flow in Ω is

equation image

where ux and uy are components of velocity, fx and fy are components of the body force vector field, p is pressure and μ denotes viscosity. In our numerical code the incompressibility constrained is achieved by penalizing the bulk deformation with a large bulk modulus κ. The boundary conditions are given as constrained velocity or vanishing traction components. In equation (4) we use the divergence rather than Laplace form (in the latter different velocity components are only coupled through the incompressibility constraint) as we expect to deal with strongly varying viscosity. It is also worth noting that even for homogeneous models the computationally advantageous Laplace form may lead to serious defects if the boundary terms are not treated adequately [Limache et al., 2007]. Additionally our formulation, equation (4), and its numerical implementation are also applicable to compressible and incompressible elastic problems due to the correspondence principle.

In analogy to the thermal problem we introduce the discrete spaces to approximate the velocity components and pressure:

equation image

where np denotes the number of pressure degrees of freedom and Πi are the pressure shape functions, which may not coincide with the velocity ones. To ensure the solvability of the resulting system of equations (inf-sup condition [see Elman et al., 2005]), special care must be taken when constructing the approximation spaces. A wrong choice of the pressure and velocity discretization results in spurious pressure modes that may seriously pollute the numerical solution. Our particular element choice is the seven-node Crouzeix-Raviart triangle with quadratic velocity shape functions enhanced by a cubic bubble function and discontinuous linear interpolation for the pressure field [e.g., Cuvelier et al., 1986]. This element is stable and no additional stabilization techniques are required [Elman et al., 2005]. The fact that in our case the velocity and pressure approximations are autonomous leads to the so-called mixed formulation of the finite element method [Brezzi and Fortin, 1991].

With the convention that velocity degrees of freedom are followed by pressure ones in the local element numbering, the stiffness matrix for the Stokes problem is given by [e.g., Bathe, 1996]

equation image

where B is the so-called kinematic matrix transforming velocity into strain rate equation image (we use here the engineering convention for the shear strain rate)

equation image

The matrix D extracts the deviatoric part of the strain rate, converts from engineering convention to standard shear strain rate, and includes a conventional factor 2. The bulk strain rate is computed according to the equation equation imagevol = Bvolue and pressure is the projection of this field onto the pressure approximation space

equation image

With the chosen approximation spaces, the linear pressure shape functions Π are spanned by the corner nodal values that are defined independently for neighboring elements. Thus it is possible to invert M on element level (the so-called static condensation) and consequently avoid the pressure unknowns in the global system. Since the pressure part of the right-hand-side vector is set to zero, this results in the following velocity Schur complement:

equation image

Once the solution to the global counterpart of (9) is obtained, the pressure can be restored afterward according to (8). The resulting global system of equations is not only symmetric, but also positive-definite as opposed to the original system (6). Unfortunately, the global matrix becomes ill-conditioned for penalty parameter values corresponding to a satisfactorily low level of the flow divergence. It is possible to circumvent this by introducing Powell and Hestenes iterations [Cuvelier et al., 1986] and keeping the penalty parameter κ moderate compared to the viscosity μ:

equation image

In the above iteration scheme the matrices A, Q, M represent global assembled versions rather than single element contributions.

4.1.3. Isoparametric Elements

To exploit the full flexibility of FEM, we employ isoparametric elements. Each element in physical space is mapped onto the reference element with fixed shape, size, and orientation. This geometrical mapping between local (ξ, η) and global (x, y) coordinates of an element is realized using the same shape functions Ni that interpolate physical fields:

equation image
equation image

where nnodel is the number of nodes in the element. The local linear approximation to this mapping is given by the Jacobian matrix J:

equation image

The shape function derivatives with respect to global coordinates (x, y) are calculated using the inverse of the Jacobian and the shape function derivatives with respect to local coordinates (ξ, η):

equation image

Thus the element matrix from equation (3) is now given by

equation image

where ∣J∣ is the determinant of the Jacobian, taking care of the area change introduced by the mapping, and Ωref is the domain of the reference element. To avoid symbolic integration equation (15) can be integrated numerically:

equation image

Here the integral is transformed into a sum over nip integration points located at (ξk, ηk), where the individual summands are evaluated and weighted by point specific Wk. For numerical integration rules for triangular elements, see, e.g., Dunavant [1985]. The numerical integration of the element matrix arising in the mechanical case is analogous.

In the following we first show the straightforward implementation of the global matrix computation and investigate its efficiency. It proves to be unsuited for high-performance computing in the MATLAB environment. We then introduce a different approach, which solves the identified problems. Finally, we discuss how to build sparse matrix data structures, apply boundary conditions, solve the system of linear equations and perform the Powell and Hestenes iterations.

4.2. Matrix Computation: Standard Algorithm

4.2.1. Algorithm Description

The algorithm outlined in Code Fragment 1 (see Figure 1) represents the straightforward implementation of section 4.1. We tried to use intuitive variable and index names; they are explained in Table A1. The details of the algorithm are described in the following (Roman numbers correspond to the comments in Code Fragment 1).

Figure 1.

Code Fragment 1 shows the standard matrix computation.

i.) The outermost loop of the standard algorithm is the element loop. Before the actual matrix computation, general element-type specific data such as integration points IP_X and weights IP_w are assigned. The derivatives of the shape functions dNdu with respect to the local (ξ, η) coordinates are evaluated in the integration points IP_X. All arrays used during the matrix computation procedure are allocated in advance, e.g., K_all.

ii.) Inside the loop over all elements the code begins with reading element-specific information, such as indices of the nodes belonging to the current element, coordinates of the nodes, and element conductivity, viscosity and density.

iii.) For each element the following loop over integration points performs numerical integration of the underlying equations, which results in the element stiffness matrix K_elem[nnodel,nnodel]. In the case of mechanical code additional matrices A_elem[nedof, nedof], Q_elem[nedof,np] and M_elem[np,np] are required. All of the above arrays must be cleared before the integration point loop together with the right-hand-side vector Rhs_elem.

iv.) Inside the integration point loop the precomputed shape function derivatives dNdui are extracted for the current integration point. b) In the chosen element type the pressure is interpolated linearly in the global coordinates. Pressure shape functions Pi at an integration point are obtained as a solution of the system P*Pi = Pb, where the first equation enforces that the shape functions Pi sum to unity.

v.) The Jacobian J[ndim,ndim] is calculated for each integration point by multiplying the element's nodal coordinates matrix ECOORD_X[ndim,nnodel] by dNdui[nnodel,ndim]. Furthermore its determinant, detJ, and inverse, invJ[ndim,ndim], are obtained with the corresponding MATLAB functions.

vi.) The derivatives versus global coordinates, dNdx[nnodel, ndim], are obtained by dNdx = dNdui*invJ according to equation (14).

vii.)a) The element thermal stiffness matrix contribution is obtained according to equation (16) and implemented as K_elem = K_elem + weight*ED*(dNdX*dNdX'). b) The kinematic matrix B needs to be formed, equation (7), and A_elem, Q_elem and M_elem are computed according to equation (6).

viii.) The pressure degrees of freedom are eliminated at this stage. It is possible to invert M_elem locally because the pressure degrees of freedom are not coupled across elements, thus there is no need to assemble them into the global system of equations. For large viscosity variations it is beneficial to relate the penalty factor PF to the element's viscosity to improve the condition number of the global matrix.

ix.) The lower (incl. diagonal) part of the element stiffness matrix is written into the global storage relying on the symmetry of the system. b) Q_elem and invM_elem matrices are stored for each element in order to avoid recomputing them during Powell and Hestenes iterations.

MATLAB provides a framework for scientific computing that is freed from the burden of conventional high-level programming languages, which require detailed variable declarations and do not provide native access to solvers, visualization, file I/O etc. However, the ease of code development in MATLAB comes with a loss of some performance, especially when certain recommended strategies are not followed: The more obvious performance considerations have already gone into the above standard implementation and we would like to point these out:

1. Memory allocation and explicit variable declaration have been performed. Although not formally required, it is advisable to explicitly declare variables, including their size and type. If variables are not declared with their final size, but are instead successively extended (filled in) during loop evaluation, a large penalty has to be paid for the continuous and unnecessary memory management. Hence, all variables that could potentially grow in size during loop execution are preallocated, e.g., K_all. Variables such as ELEM2NODE that only have to store integer numbers should be declared accordingly, int32 in the case of ELEM2NODE instead of MATLABs default variable type double. This reduces both the amount of memory required to store this large array and the time required to access it since less data must be transferred.

2. Data layout has been optimized to facilitate memory access by the CPU. For example, the indices of the nodes of each element must be stored in neighboring memory locations, and similarly the x-y-z coordinates of every node. The actual numbering of nodes and elements also has a visible effect on cache reuse inside the element loop, similarly to sparse matrix-vector multiplication problem [Toledo, 1997].

3. Multiple data transfers and computations have been avoided. Generally, statements should appear in the outermost possible loop to avoid multiple transfer and computation of identical data. This is why the integration point evaluated shape function derivatives with respect to local coordinates are precomputed outside the element loop (as opposed to inside the integration loop) and the nodal coordinates are extracted before the integration loop.

4.2.2. Performance Analysis

In order to analyze the performance of the standard matrix computation algorithm we run corresponding tests on an AMD Opteron system with 64 bit Red Hat Enterprise Linux 4 and MATLAB 2007a using GoTo BLAS ( This system has a peak performance of 4.4 gigaflops per core, i.e., it is theoretically capable of performing 4.4 billion double precision floating point operations per second (flops). The specific element types used are 6-node triangles (quadratic shape functions) with 6 integration points for the thermal problem and 7-node triangles with 12 integration points for the mechanical problem.

In the thermal problem, results are obtained for an unstructured mesh consisting of approximately 1 million nodes and 0.5 million elements. For this model the previously described matrix computation took 65 s, during which 324 floating point operations per integration point per element were calculated. This corresponds to 15 Megaflops (Mflops) or approximately 0.4% of the peak performance. Analysis of the code with MATLAB's built-in profiler revealed that a significant amount of time was spent on the calculation of the determinant and inverse of the Jacobian. Therefore, in further tests these calls were replaced by explicit calculations of detJ and invJ. The final performance achieved by this algorithm was 30 Mflops, which is still less than one percent of the peak performance and equivalent to a peak CPU performance that was reached by commodity computers more than a decade ago.

Profiling the improved standard algorithm revealed that most of the computational time was spent on matrix multiplications. This means that the efficiency of the analyzed implementation depends mainly on the efficiency of dense matrix by matrix multiplications inside the integration point loop. In order to perform these calculations MATLAB uses hardware-tuned, high-performance BLAS libraries (Basic Linear Algebra Subprograms; see and Dongarra et al. [1990]), which reach up to 90% of the CPU peak performance; a value from which the analyzed code is far away.

The cause for this bad performance is that the matrix by matrix multiplications inside the integration point loop operate on very small matrices, for which BLAS libraries are known not to work well due to the introduced overhead (e.g., Therefore, the same observation can be made when writing the standard algorithm in a compiler language such as C and relying on BLAS for the matrix multiplications, although the actual performance in this case is higher than in MATLAB. In C a possible solution is to explicitly write out the small matrix by matrix multiplications, which results in a more efficient code. In MATLAB, however, this is not a practical alternative as explicitly writing out matrix multiplications leads to unreadable code without substantial performance gains. The above performance considerations apply equally to the mechanical code.

In conclusion, the standard algorithm is a viable option when writing compiler code. However, the achievable performance in MATLAB is unsatisfactory so we developed a more efficient approach, which is presented in the following section.

Remark 1: Measuring code performance

Since no flops measure exists in MATLAB, the number of operations must be manually calculated on the basis of code inspection and divided by the computational time. To provide more meaningful performance measures only the number of necessary floating point operations may be considered, e.g., the redundant computations of the upper triangular entries in the standard matrix contribute to the flop count, which artificially increases the measured performance. However, it is not necessarily the case that the algorithm with the lowest operation count is the fastest in terms of execution time. We restrain from adjusting the actual flop counts in this paper.

4.3. Matrix Computation: Optimized Algorithm

In this section we explain how to efficiently compute the local stiffness matrices. This optimization strategy is common to both (thermal and mechanical) problems considered. For simplicity, we present it on the example of the thermal problem. Overall performance benchmarks and application examples are provided for both types of problems in subsequent sections.

The small matrix by small matrix multiplications in the integration loop nested inside the loop over elements are the bottleneck of the standard algorithm. Written out in terms of loops, these matrix multiplications represent another three loops, totaling to five. Since the element loop exhibits no data dependency, it can be moved into the innermost three (out of five), effectively becoming part of small matrix by large matrix multiplication.

This loop reordering does not change the total amount of operations. However, the number of BLAS calls is greatly reduced (ndim*nip versus nel*nip in the standard approach), and the amount of computation done per function call is drastically increased. Consequently, the overhead problem vanishes leading to a substantial performance improvement. Unfortunately, the performance decreases once a certain number of elements is exceeded. The reason for this is that the data required for the operation does not fit any longer into the CPUs cache. This which inhibits cache reuse within the integration point loop. The remedy is to operate on blocks of elements of the size for which the observed performance is best. Once a block is processed, the results are written to the main memory and the data required by the next block is copied into the cache. Data required for every block should fit (reside) in the cache at all times. The ideal block size depends on the cache structure of a CPU and must be determined system and problem specifically. This computing strategy is called “blocking” and is implemented as a part of the optimized algorithm. Coincidentally, this entire approach to optimize the FEM matrix computation is similar to vector computer implementations [e.g., Ferencz and Hughes, 1998; Hughes et al., 1987; Silvester, 1988].

4.3.1. Algorithm Description

Code Fragment 2 shows the implementation of the optimized matrix computation algorithm (see Figure 2). The key operations are explained and compared to the standard algorithm in the following.

Figure 2.

Code Fragment 2 shows the optimized finite element global matrix computation.

i.) The outermost loop of the optimized matrix computation is the block loop. Before this loop is entered, required arrays (IP_X, IP_w, dNdu) are assigned and necessary variables are allocated.

ii.) Inside the block loop the code begins with reading element specific information. Since we simultaneously operate on nelblo elements, all the corresponding global data blocks are copied into local arrays ECOORD_x, ECOORD_y, and ED, and are used repeatedly inside the integration loop.

iii.) For the entire block of elements, the loop over integration points performs numerical integration of the element matrices K_block[nelblo, nnodel*(nnodel+1)/2].

iv.) As in the standard algorithm, every iteration of the integration point loop begins by reading precomputed dNdu arrays.

v.) The Jacobian of the standard algorithm, J[ndim,ndim], is replaced by ndim matrices; Jx[nelblo,ndim] and Jy[nelblo,ndim], containing the individual rows of the Jacobian evaluated at the actual integration point for all elements of the current block. Jx and Jy are calculated by multiplying the nodal coordinates by the shape function derivatives, e.g. Jx[nelblo,ndim] = ECOORD_x[nnodel, nelblo]'*dNdui[nnodel, ndim]. Thus, instead of nelblo*nip matrix multiplications of dNdu[ndim,nnodel] and ECOORD_X[ndim,nnodel], ndim*nip multiplications involving the larger matrices ECOORD_x, ECOORD_y are performed, i.e. the same work is done with less multiplications of larger matrices. Once the Jacobian is obtained, its determinant, detJ, and inverse, split into invJx and invJy, are explicitly computed using simple operations on vectors.

vi.) The derivatives with respect to the global coordinates (x, y), dNdx[nelblo,nnodel] and dNdy[nelblo,nnodel], are obtained by multiplying the invJx and invJy by the transpose of dNdui. Again, less multiplication calls involving larger matrices are performed.

vii.) The local stiffness matrix contribution for all the elements in the block, K_block[nelblo,nnodel*(nnodel+1)/2], is computed according to equation (16). Note that exploiting symmetry allows for calculation of only the lower triangle of stiffness matrices, which substantially reduces the operation count.

viii.) After the numerical integration of K_block is completed, the results are written into the global storage K_all, again exploiting symmetry by storing only the lower triangular part.

ix.) The number of elements remaining in the final block might be smaller than the nelblo. Consequently, nelblo and several arrays must be adjusted.

4.3.2. Performance Analysis

To illustrate the performance of the optimized matrix computation systematic tests were run with the same 1 million node problem that was used for the performance analysis of the standard algorithm. Since larger matrices resulting from larger block sizes should yield better BLAS efficiency, the performance in Mflops is plotted versus the number of elements in a block; see Figure 3. This plot confirms the arguments for the introduction of the blocking algorithm. Starting from approximately the performance of the standard algorithm, a steady increase can be observed up to ∼350 Mflops, which on the test system is reached for a block with ∼1000 elements for thermal problem. Further increase of the block size leads to a performance decrease toward a stable level of ∼120 Mflops due to lack of cache reuse in the integration point loop. Compared to the standard version, the optimized matrix computation achieves a 20-fold speedup in terms of flops performance. Since the optimized algorithm performs fewer operations (computation of only lower triangular part of symmetric element matrix), its execution time is actually more than 30 times faster.

Figure 3.

Performance of optimized matrix computation versus block size.

The achieved 350 Mflops efficiency corresponds to only ∼8% of the peak CPU performance. Profiling the code revealed that for the test problem approximately half of the time was spent on reading and writing variables from and to RAM (e.g., nodal coordinates and element matrices). This value is constrained by the memory bandwidth of the hardware, which on current computer architectures is often a bigger bottleneck than the CPU performance. Compared to C implementations, the optimized matrix computation performance is better than the straightforward standard algorithm using BLAS, but more than a factor 3 slower than what can be achieved by explicitly writing out the matrix multiplications.

In the mechanical code, the peak flops performance is similar. Note that in this case the optimal blocksize is smaller due to the larger workspace of the method; see Figure 3.

4.4. Matrix Assembly: Triplet to Sparse Format Conversion

The element stiffness matrices stored in K_all must be assembled into the global stiffness matrix K. The row and column indices (K_i and K_j) that specify where the individual entries of K_all have to be stored in the global system are commonly known as the triplet sparse matrix format [e.g., Davis, 2006]. Since we only use lower triangular entries, special care must be taken so that the indices referring to the upper triangle are not created; see Code Fragment 3. Note that K_i and K_j hold duplicate entries, and the purpose of the MATLAB sparse function is to sum and eliminate them.

While creation of the triplet format is fast, the call to sparse gives some concerns. MATLABs sparse implementation requires that K_i and K_j are of type double, which is memory- and performance-wise inefficient. In addition, sparse itself is rather slow, especially if compared to the time spent on the entire matrix computation. The equivalent function sparse2, provided by T. A. Davis within the CHOLMOD package (, is substantially faster and does not require a conversion of the coefficients to double precision. Code Fragment 3 presents in detail how to create a global system matrix.

Code Fragment 3 shows the global sparse matrix assembly.    % CREATE TRIPLET FORMAT INDICES    indx_j = repmat(1:nnodel,nnodel,1); indx_i = indx_j′;     indx_i = tril(indx_i); indx_i = indx_i(:); indx_i = indx_i(indx_i>0);     indx_j = tril(indx_j); indx_j = indx_j(:); indx_j = indx_j(indx_j>0);     K_i = ELEM2NODE(indx_i,:);     K_j = ELEM2NODE(indx_j,:);     K_i = K_i(:);     K_j = K_j(:);     % SWAP INDICES REFERRING TO UPPER TRIANGLE     indx = K_i < K_j;     tmp = K_j(indx);     K_j(indx) = K_i(indx);     K_i(indx) = tmp;     K_all = K_all(:);     % CONVERT TRIPLET DATA TO SPARSE MATRIX     K = sparse2(K_i, K_j, K_all);     clear K_i K_j K_all;

The triplet format is converted into the sparse matrix K with one single call to sparse2. Assembling smaller sparse matrices for blocks of elements and calling sparse consecutively would reduce the workspace for the auxiliary arrays; however, it would also slow down the code. Therefore, as long as the K_i, K_j and K_all arrays are not the memory bottleneck, it is beneficial to perform the global conversion. Once K is created, the triplet data is cleared in order to free as much memory as possible for the solution stage. In the mechanical code Q and M−1 matrices are stored in sparse format for later reuse in the Powell and Hestenes iterations.

Remark 2: Symbolic approach to sparse matrix assembly.

In general the auxiliary arrays can be altogether avoided with a symbolic approach to sparse matrices. While the idea of sparse storage is the elimination of zero entries, in a symbolic approach all possible nonzero entries are stored and initialized to zero. During the computation of element stiffness matrices, global locations of their entries can be found at a small computational cost, and corresponding values are incrementally updated. Also, this symbolic storage pattern can be reused between subsequent time steps, as long as the mesh topology is not changed. Unfortunately, this improvement cannot be implemented in MATLAB as zero entries are automatically deleted.

4.5. Boundary Conditions

The implemented models have two types of boundary conditions: vanishing fluxes and Dirichlet. While the former automatically results from the FEM discretization, the latter must be specified separately, which usually leads to a modification of the global stiffness matrix. These modifications may, depending on the implementation, cause loss of symmetry, changes in the sparsity pattern and row addressing of K, all of which can lead to a badly performing code.

An elegant and sufficiently fast approach is to separate the degrees of freedom of the model into Free (indices of unconstraint degrees of freedom) and Bc_ind, where Dirichlet boundary conditions with corresponding values Bc_val are applied. Since the solution values in the Bc_ind are known, the corresponding equations can be eliminated from the system of equations by modifying the right-hand side of the remaining degrees of freedom accordingly. This is implemented as shown in Code Fragment 4.

Code Fragment 4 shows the boundary condition implementation for the thermal problem.    Free = 1:nnod;     Free(Bc_ind) = [];     TMP = K(:,Bc_ind) + cs_transpose(K(Bc_ind,:));     Rhs = Rhs - TMP*Bc_val′;     K = K(Free,Free);     T = zeros(nnod,1);     T(Bc_ind) = Bc_val;

Since only the lower part of the global matrix is stored, we need to restore the remaining parts of the columns by transposing the adequate rows.

4.6. System Solution

We have ensured that the global system of linear equations under consideration is symmetric, positive-definite, and sparse. It has the form

equation image

where K is the stiffness matrix, T the unknown temperature vector, and Rhs is the right-hand side. One of the fastest and memory efficient direct solvers for this type of systems is CHOLMOD, a sparse supernodal Cholesky factorization package developed by T. Davis [Davis and Hager, 2005; Y. Chen et al., Algorithm 8xx: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate, submitted to ACM Transactions on Mathematical Software, 2007; T. A. Davis and W. W. Hager, Dynamic supernodes in sparse Cholesky update/downdate and triangular solves, submitted to ACM Transactions on Mathematical Software, 2007]; see the report by Gould et al. [2007]. Newer versions of MATLAB (2006a and later) use this solver, which is substantially faster than the previous implementation. When symmetric storage is not exploited, CHOLMOD can be invoked through the backslash operator: T = K\Rhs (make sure that the matrix K is numerically symmetric, otherwise MATLABs will invoke a different, slower solver).

However, it is best to use CHOLMOD and the related parts by installing the entire package from the developers SuiteSparse Web site ( This provides access to cholmod2, which is capable of dealing with only upper triangular input data and precomputed permutation (reordering) vectors. SuiteSparse also contains lchol, a Cholesky factorization operating only on lower triangular matrices, which is faster and more memory efficient than MATLABs chol equivalent. Reusing the Cholesky factor L during the Powell and Hestenes iterations in the mechanical problem greatly reduces the computational cost of achieving a divergence free flow solution.

The mentioned reuse of reordering data is possible as long as the mesh topology remains identical, which even in our large strain flow calculations is the case for many time steps. The reordering step decreases factorization fill-in and consequently improves memory and CPU efficiency [Davis, 2006], but is a rather costly operation compared to the rest of the Cholesky algorithm. Different reordering schemes can be used, and we compare two of them in Figure 4: AMD (Approximate Minimum Degree) and METIS ( While AMD is faster during the reordering steps, it results in slower Cholesky factorization and forward and back substitution. If the reordering can be reused for a large number of steps, it is recommended to rely on METIS, which is accessible in MATLAB through the SuiteSparse package.

Figure 4.

Performance analysis of the different steps of the Cholesky algorithm with different reorderings for our one million degrees of freedom thermal test problem.

4.7. Powell and Hestenes Iterations

In the thermal code, the solution vector is obtained by calling forward and back substitution routines with the Cholesky factor and the adequately permuted right-hand-side vector. During the second substitution phase the upper Cholesky factor is required. However, instead of explicitly forming it through the transposition of the stored lower factor, it is advantageous to call the cs_ltsolve that can operate on the lower factor and performs the needed task of the back substitution.

In the MILAMIN flow solver the incompressibility constraint is achieved through an iterative penalty method, i.e., the bulk part of the deformation is suppressed with a large bulk modulus (penalty parameter) κ. In a single step penalty method there is a trade off between the incompressibility of the flow solution and the condition number of the global equation system. This can be avoided by using a relatively small κ, which ensures a good condition number and then iteratively improving incompressibility of the flow. Note that for the chosen Crouzeix-Raviart element, pressure is discontinuous between elements and the corresponding degrees of freedom can be eliminated element-wise (no global system solution required). Pressure increments can be computed with the velocity solution vector and stored Q and M−1 matrices. These pressure increments are sent to the right-hand side of the system and accumulated in the total pressure. The code fragment for these so-called Powell and Hestenes iterations is given in Code Fragment 5.

Code Fragment 5 shows the Powell and Hestenes iterations.    while (div_max>div_max_uz && uz_iter<uz_iter_max)     uz_iter = uz_iter + 1;     %FORWARD AND BACK SUBSTITUTION     Vel(Free(perm)) = cs_ltsolve(L,cs_lsolve(L,Rhs(Free(perm))));     %COMPUTE QUASI-DIVERGENCE     Div = invM*(Q*Vel);     %UPDATE RHS     Rhs = Rhs – PF*(Q'*Div);     %UPDATE TOTAL PRESSURE     Pressure = Pressure + PF*Div;     %CHECK INCOMPRESSIBILITY     div_max = max(abs(Div(:)));     end

5. Postprocessor

The results of a numerical model are only useful if fast and precise analysis and visualization is possible. One of the main aspects to achieve this is to avoid loops. For triangular meshes trisurf is the natural choice for two and three dimensional data visualization as it employs the usual FEM structures: connectivity (ELEM2NODE), coordinates (GCOORD), and data (T). This allows for visualization of FEM models with more than one million elements in less than one second.

A problem that often arises is the visualization of discontinuous data, such as pressure in mixed formulations of deformation problems. The remedy is to abandon the nodal connectivity and to create a new arrangement, where physical nodes are listed separately for every element that accesses them. The same can also be done for other meshes than triangular ones by creating the corresponding connectivity (ELEM2NODE) and calling:

Code Fragment 6 shows the postprocessor    .patch(′faces′, ELEM2NODE,′vertices′,GCOORD′,′facevertexcdata′,T);    shading interp;

6. MILAMIN Performance Analysis

6.1. Overall Performance

The overall performance of MILAMIN versus the number of nodes is analyzed in Figure 5. The goal of MILAMIN to perform a complete FEM analysis for one million unknowns in one minute is reached for the thermal as well as the mechanical problem. All components of MILAMIN scale linearly with the number of nodes; the only exception is the direct solver, which shows super-linear scaling. The performance details are discussed in the following sections.

Figure 5.

Overall performance results for MILAMIN given for total time spent on problem, and the direct solver contribution.

6.2. Component Performance

Figure 6 shows the total amount of time for the one million degrees of freedom (DOFs) test problems split into the individual components of MILAMIN. The contributions of the boundary conditions and postprocessor are minor. The time taken by the preprocessor is also not relevant, especially if the same (Lagrangian) mesh is used for many time steps. A major achievement of MILAMIN is the performance of the optimized matrix computation that is more than 15–30fold better than the standard algorithm. The matrix assembly done by sparse2 is one of the major contributors to the total time, but cannot be optimized without a major change in the way MATLAB operates on sparse matrices; see Remark 2. Finally the three components of the Cholesky solver take substantial time.

Figure 6.

Overall performance of MILAMIN split up into the individual components for thermal and mechanical test problems with one million degrees of freedom. The timing for the matrix computation is given for the standard (S) and the optimized (O) algorithm. Note that the forward and backward (F&B) substitution timing also contains three Powell and Hestenes iterations in the case of the mechanical problem.

The time taken by the first part of the Cholesky solver, the reordering, can often be neglected for practical applications. During nonlinear material and time step iterations the mesh topology remains the same as long as no remeshing is performed, and the permutation vector can be reused if the SuiteSparse package is employed.

The second component of the Cholesky solver is the factorization. This step takes most of the total MILAMIN execution time. However, the efficiency achieved by CHOLMOD is close to the optimal CPU performance. For further optimization one could consider other types of solvers such as iterative ones. Yet, preconditioned iterative methods or algebraic multigrid are less robust (especially for large material contrasts as targeted here) and perform better only for large systems; see section 6.4. These methods are the only option in the case of most three-dimensional problems, because the scaling of factorization time and memory requirements for direct solvers is much worse than in two dimensions. However, for two dimensional problems direct solvers are the best choice for resolutions on the order of one million degrees of freedom, especially for positive definite systems that can be solved with Cholesky factorizations. Moreover, it is in problems of this size where our optimizations greatly reduce the total solution time. Such numerical resolutions are often sufficient in two dimensions to solve challenging problems and the achieved performance allows for studies with large number of time steps.

The third part of the Cholesky solver is the forward and backward substitution and does not contribute substantially in the case of thermal problems. For mechanical problems several Powell and Hestenes iterations are required to enforce incompressibility, each issuing a forward and back substitution call plus other computations. The time spent on the Powell and Hestenes iteration is not negligible, but the strategy chosen to deal with incompressibility is clearly advantageous to other strategies that would not allow the use of Cholesky solvers; see, for example, the results for FEMLAB using UMFPACK in section 6.4.

A final analysis of the overall speedup achieved by MILAMIN is shown in Figure 7, where we depict the ratio of the total time tstandard/toptimized for the thermal and mechanical code. In this speedup analysis we define the total time as the sum of the time needed to compute and assemble the global matrix, apply boundary conditions, factorize and solve the system of equations, and perform the Powell and Hestenes iterations (incompressible Stokes flow). Thus mesh generation, postprocessing, and reordering, which do not need to be performed for every time step, do not enter this analysis. For our target system sizes the achieved speedups reach approximately 3 and 4 for the mechanical and thermal codes, respectively. Hence the performance gains due to the developed MILAMIN package are substantial. The scaling with respect to system size shows that the speedup decreases with increasing number of nodes. This is due to the super-liner scaling of the direct solver, which starts to dominate the total execution time for very large systems.

Figure 7.

Achieved MILAMIN speedup for all operations that need to be performed for every time step; see text for details.

6.3. Memory Requirements

Besides CPU performance the available memory (RAM) is the other parameter that determines the problem size that can be solved on a specific machine. The memory requirements of MILAMIN are presented in Figure 8. Within the studied range of systems sizes, all data allocated during the matrix computation and assembly requires substantially less memory than the solution stage. Thus the auxiliary arrays such as K_i, K_j and K_val are not a memory bottleneck and it is indeed beneficial to perform conversion to sparse format globally. Note that the amount of memory required during the factorization stage depends strongly on reordering used. This analysis is only approximate as the workspace of the external routines (lchol, sparse2, etc.) is not taken into account. On 2 Gb RAM computers we are able to solve systems consisting of 1.65 and 0.65 million nodes for the thermal and mechanical problems, respectively.

Figure 8.

Memory requirements of the thermal and mechanical versions of MILAMIN.

6.4. Comparison to Other Software

In this section we compare MILAMIN to different available commercial and free software solving similar test problems. Table 1 presents run times for a thermal problem with ∼1 million degrees of freedom. The model setup consists of a box with a circular hole (zero flux) and a circular inclusion of ten times higher conductivity than the matrix. The outer boundaries are set to Dirichlet conditions representing a linearly varying temperature field.

Table 1. Performance Results for Different Software Packages for the Thermal Problema
SoftwareMatrix Computation and AssemblySolveSolver Type
  • a

    T1 and T2 stand for linear and quadratic triangles, and Q1 and Q2 stand for linear and quadratic quadrilateral elements, respectively.

ABAQUS, T280260proprietary
FEAPpv, Fortran, T27712PCG
OOFEM, C++, T136400ICCG
TOCHNOG, C\C++, T2151711BiCG
AFEM@matlab, T12519MATLAB \

The software that entered the test are commercial finite element packages, ABAQUS (SIMULIA, 6.6-1, and FEMLAB (COMSOL 3.3,, and open source packages FEAPpv (O. C. Zienkiewicz and R. L. Taylor, 2.0,∼rlt/feappv), OOFEM (B. Patzak, OOFEM 1.7,, and TOCHNOG (D. Roddeman, 11 February 2001, for compiler languages, and AFEM@matlab (L. Chen and C. Zhang,, and IFISS (D. J. Silvester et al., 2.2,∼djs/ifiss) for MATLAB. For the solution stage we used a wide range of direct solvers, including UMFPACK (T. A. Davis,, TAUCS (S. Toledo et al.,∼stoledo/taucs), PARDISO (O. Schenk and K. Gärtner,, SPOOLES (C. Ashcraft et al.,, CHOLMOD (T. A. Davis,, and the MATLAB backslash operator (\). We also compared different implementations of iterative solvers such as Conjugate Gradients preconditioned with Jacobi (PCG), Symmetric Successive Over-Relaxation (SSOR-CG), Incomplete Cholesky (ICCG), and Algebraic Multigrid (AMG-CG), and a Biconjugate Gradients solver preconditioned with Jacobi (BiCG).

A number of other MATLAB-based packages are available, which, however, could not enter our table because they are simply incapable of solving the test problem in a reasonable amount of time and the amount of RAM available. From the MATLAB packages that entered the performance comparison AFEM excels with high performance. However, AFEM is specifically developed to operate with linear triangles solving the Poisson problem. This allows AFEM to employ only one integration point and the amount of work performed is substantially less than for isoparamteric quadratic elements, although the actual number of elements is higher for the test problem with a fixed number of nodes. IFISS is another MATLAB-based package capable of solving Poisson and incompressible Navier-Stokes problems on the basis of linear and quadratic quadrilateral meshes. Despite its aim of being a vectorized code, the performance of IFISS is not optimal. This is partly due to a badly performing boundary condition implementation. The matrix computation and assembly performance of the compile language and commercial codes is quite reasonable, with FEAP being the clear leader. However, none of the tested packages is as fast for the matrix computation and assembly as the optimized version of MILAMIN and even the standard version of MILAMIN is performing quite reasonably in comparison.

The analysis of the solver times confirms our previous statement that for the studied 2-D problems direct solvers (CHOLMOD, UMFPACK, TAUCS, PARDISO, SPOOLES) are the best choice with CHOLMOD being the best in the group. Iterative solvers, even if equipped with good preconditioners, like incomplete Cholesky or AMG, are not competitive with respect to the direct solvers for the targeted problem size.

A performance comparison of MILAMIN for a mechanical test problem is given in Table 2. The domain is again a box containing a circular hole (free surface) and a circular inclusion with a ten times higher viscosity than the matrix. The outer boundaries are set to Dirichlet conditions representing pure shear deformation. The number of available packages to solve incompressible Stokes problems with heterogeneous material is greatly reduced compared to the thermal problem. In fact the IFISS package is not capable of dealing with heterogeneous materials and we used here an iso-viscous model. In the case of FEMLAB we had to employ the special MEMS module, which provides an incompressible Stokes application mode. However, even with this specialized module we were unable to fit the test problem into the 2 Gb RAM and therefore the results are provided for a five times smaller problem size. MILAMIN outperforms IFISS as well as FEMLAB both in terms of matrix computation and assembly, and the solution time. The latter demonstrates that iterative penalty approach chosen in MILAMIN and the resulting possibility to use a Cholesky solver (symmetric and positive definite system) is superior to other approaches.

Table 2. Performance Results for Different Software Packages for the Mechanical Problema
SoftwareMatrix Computation and AssemblySolveSolver Type
  • a

    Note the different system sizes for this test.

IFISS Q2-P1 (5e5 DOFs)340298MATLAB \
FEMLAB 3.3 T2+P-1 (2e5 DOFs)766UMFPACK
MILAMIN (opt) T2 + P-1 (1e6 DOFs)1534CHOLMOD (AMD)

6.5. Applications

The power of MILAMIN to perform high-resolution calculation for heterogeneous problems is illustrated with a thermal and a mechanical application example. Figure 9 shows the heat flux through a heterogeneous rock requiring approximately one million nodes to resolve it. Figure 10 shows a mechanical application of MILAMIN. Gravity-driven incompressible Stokes flow is used to study the interaction of circular inclusions with different densities leading to a stratification of the material; see Animation S1.

Figure 9.

Illustration of a one million node application problem modeled with MILAMIN. Steady state diffusion is solved in a heterogeneous rock with channels of high conductivity. Heat flow is imposed by a horizontal thermal gradient; i.e., T(left boundary) = 0, T(right boundary) = 1. Top and bottom boundary conditions are zero flux. (a) Conductivity distribution. (b) Flux visualized by cones and colored by magnitude. Normalization versus flux in homogeneous medium with conductivity of the channels. Background color represents the conductivity. Triangular grid is the finite element mesh used for computation. Note that this picture only corresponds to a small subdomain of Figure 9a (see square outline).

Figure 10.

Mechanical application example. Circular inclusions in box subjected to vertical gravity field. Black (heavy) and white (light) inclusions have the same density contrast with respect to the matrix. They are hundred times more viscous than the matrix. Figures 10a and 10b show (unsmoothed) pressure perturbations, Figures 10c and 10d show maximum shear strain rate, and Figures 10e and 10f shows the magnitude of the velocity field with superposed velocity arrows (random positions). All values are normalized by the corresponding maximum value generated by a single inclusion of the same size centered in the same box. Figures 10a, 10c, and 10e show the entire domain; Figures 10b, 10d, and 10f show a zoom-in with superposed finite element mesh according to the white square.

MILAMIN not only allows us to study the overall response of the system, but also resolves the details of the flow pattern around the heterogeneities. Note that we see none of the pressure oscillation problems that are often caused by the incompressibility constraint [e.g., Pelletier et al., 1989].

The MILAMIN strategies and package are applicable to a much broader class of problems than illustrated here. For example, transient thermal problems require only minor modifications to the thermal solver. As already mentioned the mechanical solver is devised in a way that compressible and incompressible elastic problems can be easily treated, simply by variable substitution. Coupled thermomechanical problems, arising for example in mantle convection, only require that the developed thermal and mechanical models are combined in the same time loop. This results in an unstructured, Lagrangian mantle convection solver capable of efficiently dealing with hundreds of thousands of nodes [cf. Davies et al., 2007].

7. Conclusions

We have demonstrated that it is possible to write an efficient native MATLAB implementation of the finite element method and achieved the goal to set up, process, and postprocess thermal and mechanical problems with one million degrees of freedom in one minute on a desktop computer.

In our standard implementation we have combined all the state of the art components required in a finite element implementation. These include efficient preprocessing, fast matrix assembly, exploiting matrix symmetry for storage, and employing the best available direct solver and reordering packages. MATLAB-specific optimizations include proper memory management (preallocation of arrays) and data structures, explicit type declaration for integer arrays, and efficient implementation of boundary conditions. In the case of the mechanical application the chosen penalty method together with the particular element type allows us to use the efficient Cholesky factorization to solve the incompressible flow problem. The clear structure of the code serves the educational purposes well. The results of our software comparison show that our standard version performs surprisingly efficiently even compared to packages implemented in compiler languages.

Furthermore, in our optimized version we have improved the efficiency of the stiffness matrix calculations, which resulted in an overall execution speedup of approximately 4 times with respect to the standard version. This has been done by minimizing the ratio of overhead (BLAS and MATLAB) to computation. Another priority was to avoid unnecessary data transfers and promote cache reuse, as memory speed is a major bottleneck on current computer architectures. Particular optimizations to the matrix computation algorithm include (1) increased performance of the BLAS operations by interchanging loops and operating on large matrices, (2) reducing the total operation count by exploiting the symmetry of the system, and (3) facilitating cache reuse through the introduction of blocking.

Our implementation of the matrix computation achieves a sustained performance of 350 Mflops for any system size. Any further performance improvements to this part of the code are irrelevant, since even for smallest systems the matrix computation now takes only a fraction of the total solution time, with the solver being the bottleneck.

By paying attention to the strategies outlined in this article, MATLAB-based MILAMIN can not only be used as a development and prototype tool, but also as a production tool for the analysis of two dimensional problems with millions of unknowns within minutes. The complete MILAMIN source code is available from the authors and can be downloaded as auxiliary material (see Software S1).

Appendix A.

Table A1 lists the variables used throughout the paper and in the code to facilitate its understanding. Variable names, their sizes, and short descriptions are given.

Table A1. MILAMIN Variablesa
Variable GroupVariableSizeDescription
  • a

    Note: “aeib” stands for “all elements in block.”

Variable sizendim1number of dimensions
 nel1number of elements
 nnod1number of nodes
 nnodel1number of nodes per element
 nedof1number of thermal or velocity degrees of freedom per element
 np1number of pressure degrees of freedom per element
 nip1number of integration points per element
 nelblo1number of elements per block
 nblo1number of blocks
 npha1number of material phases
 nbc1number of constraint degrees of freedom
 nfree1number of unconstraint degrees of freedom
MeshELEM2NODE[nnodel, nel]connectivity
 Phases[1, nel]phase of elements
 GCOORD[ndim, nnod]global coordinates of nodes
Integration points, shape functions and their derivativesIP_X[ndim, nip]local coordinates of integration points
 IP_w[1, nip]weights of integration points
 N{nip*[ nnodel, 1 ]}cell array of nip entries of shape functions Ni evaluated at integration points
 dNdu{nip*[ nnodel, ndim]}cell array of nip entries of shape functions derivatives wrt local coordinates dNdui evaluated at integration points
GeometryECOORD_X[ndim, nnodel]global coordinates of nodes in element
 J[ndim, ndim]Jacobian in integration point
 invJ[ndim, ndim]inverse of Jacobian
 detJ1 or [nelblo,1]determinant of Jacobian (or aeib)
 dNdX[nnodel, ndim]shape function derivatives wrt global coordinates in integration point
 ECOORD_x, ECOORD_y[nnodel, nelblo]global x and y coordinates for nodes (aeib)
 Jx, Jy[nelblo, ndim]first (x) and second (y) row of Jacobian in integration point (aeib)
 invJx, invJy[nelblo, ndim]first (x) and second (y) column of inverse of Jacobian (aeib)
 dNdx, dNdy[nelblo, nnodel]shape function derivatives wrt global x and y coordinate (aeib)
Auxiliary arraysindx_l[nedof*(nedof+1)/2,1]indices extracting lower part of element matrix
Boundary conditionsFree[1, nfree]unconstraint degrees of freedom
 Bc_ind[1, nbc]constraint degrees of freedom
 Bc_val[1, nbc]constraint boundary values
Solutionperm[1,nfree]permutation vector reducing factorization fill-in
 L[nfree, nfree]sparse lower Cholesky factor of global stiffness matrix
 Rhs[nfree, 1]global right-hand-side vector
MaterialsD[npha,1]conductivities for different phases
 ED1 or [nelblo,1]conductivity of element (or aeib)
Matrix calculationsK_elem[nnodel, nnodel]element stiffness matrix
 K_block[nelblo, nnodel*(nnodel+1)/2]flattened element stiffness matrices (aeib)
Triplet storageK_i[nnodel*(nnodel+1)/2, nel]row indices of triplet sparse format for K_all
 K_j column indices of triplet sparse format for K_all
 K_all flattened element stiffness matrices for all elements
Solution stageK[nfree, nfree]sparse global stiffness matrix (only lower part)
 T[nnod, 1]unknown temperature vector
MaterialsMu, Rho[npha,1]viscosity and density for different phases
 EMu, ERho1 or [nelblo,1]viscosity and density of element (or aeib)
Matrix calculationsPi[np,1]pressure shape functions in integration point
 P[np, np]auxiliary matrix containing global coordinates of the corner nodes
 Pb[np,1]auxiliary vector containing global coordinates of integration point
 B[nedof, ndim*(ndim+1)/2]kinematic matrix
 A_elem[nedof, nedof]element stiffness matrix (velocity part)
 Q_elem[np, nedof]element divergence matrix
 M_elem[np, np]element pressure mass matrix
 invM_elem[np, np]inverse of element pressure mass matrix
 Rhs_elem[ndim, nedof]element right-hand-side vector
 PF1penalty factor
 GIP_x, GIP_y[1,nelblo]global x and y coordinates of integration point (aeib)
 Pi_block[nelblo, np]pressure shape functions in integration point (aeib)
 A_block[nelblo, nedof*(nedof+1)/2]flattened element stiffness matrices (aeib)
 Q_block[nelblo, nedof*np]flattened element divergence matrices (aeib)
 M_block[nelblo, np*(np+1)/2]flattened element pressure mass matrices (aeib)
 invM_block[nelblo, np*np]flattened inverses of element pressure mass matrices (aeib)
 Rhs_block[nelblo, nedof]element right-hand-side vectors (aeib)
Triplet storageRhs_all[nedof, nel]element right-hand-side vectors for all elements
 A_i[nedof*(nedof+1)/2, nel]row indices of triplet sparse format for A_all
 A_j column indices of triplet sparse format for A_all
 A_all flattened element stiffness matrices for all elements
 Q_i[nedof*np, nel]row indices of triplet sparse format for Q_all
 Q_j column indices of triplet sparse format for Q_all
 Q_all flattened element divergence matrices for all elements
 invM_i[np*np, nel]row indices of triplet sparse format for invM_all
 invM_j column indices of triplet sparse format for invM_all
 invM_all flattened inverses of element pressure mass matrices for all elements
Solution stageA[nfree, nfree]sparse global stiffness matrix (only lower part)
 Q[np*nel, ndim*nnod]sparse divergence matrix
 invM[np*nel, np*nel]sparse pressure mass matrix
 Div[nel*np, 1]quasi-divergence vector
 Vel[ndim*nnod, 1]unknown velocity vector
 Pressure[nel*np, 1]unknown pressure vector


This work was supported by the Norwegian Research Council through a Centre of Excellence grant to PGP. We would like to thank Tim Davis, the author of the SuiteSparse package, for making this large suite of tools available and giving us helpful comments. We would also like to thank J. R. Shewchuk for making the mesh generator Triangle freely available. We are grateful to Antje Keller for her help regarding code benchmarking. We thank Galen Gisler for improving the English. The manuscript benefited from the reviews by Boris Kaus and Eh. Tan and the editorial work of Peter van Keken. Finally, we would like to thank Yuri Podladchikov for his never-ending enthusiasm and stimulation.