Mixed‐Precision for Linear Solvers in Global Geophysical Flows

Abstract Semi‐implicit (SI) time‐stepping schemes for atmosphere and ocean models require elliptic solvers that work efficiently on modern supercomputers. This paper reports our study of the potential computational savings when using mixed precision arithmetic in the elliptic solvers. Precision levels as low as half (16 bits) are used and a detailed evaluation of the impact of reduced precision on the solver convergence and the solution quality is performed. This study is conducted in the context of a novel SI shallow‐water model on the sphere, purposely designed to mimic numerical intricacies of modern all‐scale weather and climate (W&C) models. The governing algorithm of the shallow‐water model is based on the non‐oscillatory MPDATA methods for geophysical flows, whereas the resulting elliptic problem employs a strongly preconditioned non‐symmetric Krylov‐subspace Generalized Conjugated‐Residual (GCR) solver, proven in advanced atmospheric applications. The classical longitude/latitude grid is deliberately chosen to retain the stiffness of global W&C models. The analysis of the precision reduction is done on a software level, using an emulator, whereas the performance is measured on actual reduced precision hardware. The reduced‐precision experiments are conducted for established dynamical‐core test‐cases, like the Rossby‐Haurwitz wavenumber 4 and a zonal orographic flow. The study shows that selected key components of the elliptic solver, most prominently the preconditioning and the application of the linear operator, can be performed at the level of half precision. For these components, the use of half precision is found to yield a speed‐up of a factor 4 compared to double precision for a wide range of problem sizes.


Introduction
In the simulation of complex multi-scale flows arising in weather and climate (W&C), one of the biggest challenges is to satisfy strict service requirements in terms of time-to-solution and to satisfy budgetary constraints in terms of energy-to-solution, without compromising the accuracy and stability of the application [1].One way to tackle this challenge is to use reduced-precision arithmetic in W&C models.The interest with reduced or mixed precision has already some history in scientific computation [2,3,4,5].However, it is relatively new in W&C modelling.This effort started with [6] and [7] where reduced precision was motivated by the large degree of uncertainty that is present in W&C models due to their nonlinear dynamics and high level of complexity which makes it difficult to justify high numerical precision.The study of mixed precision in numerical modelling is timely, since the recent boom of machine learning methods is pushing developments of supercomputing hardware towards very efficient dense linear algebra performed at low precision.This hardware is specialised for the use in deep learning where half precision arithmetic-or even less-is often sufficient; see for example the Tensor Processing Unit (TPU) by Google.
A large part of W&C models is used to represent subgrid scale processese.g., turbulence and cloud physics-that carry a large degree of inherent uncertainty [8].In principle one can thus argue that representation of such processes is the prime candidate for a precision reduction.The other dominant part of W&C models is the dynamical core (a.k.a dycore) which provides discrete solutions to the Navier-Stokes equations.Although there is little uncertainty about the equations per se, their numerical solution procedures introduce model error that add uncertainty to the dynamics.Early work on thorough analysis to identify the minimal level of numerical precision that can be used in different parts of dycores has focused on spectral model formulations [9,10].Similar work on grid-point model formulations has only just begun with the use of single precision arithmetic [11,12,13] with 32 bits per variable.
In efficient W&C dycores the semi-implicit (SI) timestepping is typically used for the atmospheric [14] as well as the oceanic [15,16,17] component.Its key virtue is extended computational stability with respect to acoustic, buoyant and rotational modes of motion allowing for the use of relatively long time steps.On the other hand, the approach results in an intricate linear problem and complicated elliptic boundary value problem for pressure (viz.Schur complement).In particular, the SI approach is at the heart of the newly-developed Finite-Volume Module (FVM) of the Integrated Forecasting System (IFS) at the European Centre for Medium-Range Weather Forecasts (ECMWF) [18,19,20,21].The FVM provides an alternative dynamical core to the spectral transform based IFS.The two essential ingredients of the FVM timestepping is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) approach [22,23,24] and the bespoke preconditioned nonsymmetric Krylov-subspace elliptic solver [21] built on the Generalized Conjugated-Residual (GCR) approach of [25].While MPDATA controls the explicit part of the timestepping, GCR inverts the Schur complement of the linear problem.
Undeniably, elliptic solvers are computationally demanding and substantial development has been invested into making them as efficient as possible for W&C models on modern supercomputers [26,27,21].To efficiently solve an elliptic problem posed in a thin spherical shell (such as the global atmosphere) ultimately requires matrix inversion-if not in the main solver than at least in its preconditioner.While there is the common opinion that high precision is required for this purpose, there are theoretical studies [2,3,28,29,30] suggesting that mixed-precision approaches could be exploited, and there are already some applications of mixed-precision elliptic solvers in CFD [31,32,4] and also in W&C [13].Although these approaches may differ in choice of algorithms and applications, a common theme emerges that the preconditioning step may be a good choice for reducing precision, where some work goes as low as half precision arithmetic (16 bits per variable) representing each real number with only 16 bits (as a floating point number) and decimal precision reduced to three digits.
The aim of this paper is to establish the limits of precision for the specific combination of characteristics that make SI W&C models unique, while its objective is to perform an in-depth analysis of the behavior of mixed-precision elliptic solvers in such an environment.The elliptic problems encountered in SI W&C models are non-symmetric, and are typically solved using iterative solvers due to the enormous problem size.We solve a fluid dynamics problem on the sphere, which comes with grid irregularities, giving rise to a specific structure of the involved linear operator.In effect, a standard classification of the problem based solely on the operator condition number does not give full justice to the complexity of the problem; cf.[20,21].Moreover, the elliptic problem is part of an entire SI timestepping scheme and as such is in a feedback loop with the non-linear dynamics of the model.Consequently, we solve much more at each time step than a sole elliptic problem.Different solver solutions, even obtained to the same accuracy in terms of the residual errors, can lead to vastly different behavior and model error growth throughout a simulation.For these reasons, our mixed-precision elliptic solver needs to be tailored to the specific application at hand and thoroughly tested in this specific W&C environment.
Herein, we study the impact of mixed-precision in the elliptic solver for W&C models by investigating a representative SI shallow-water model on the sphere and using the MPDATA approach.With these choices, we stay conceptionally close to the ideas employed in the IFS-FVM.The latter is formulated in the latitude-longitude (lat-lon) coordinate framework, but circumvents polar grid singularity via a quasi-uniform unstructured mesh discretisation, whereby the essential problem stiffness comes from the thin vertical dimension.The shallow-water model of this paper uses a regular lat-lon grid subject to polar singularity [33], but retains the aspect of the problem stiffness which is coming from the longitudinal dimension.Given the stiffness and grid singularity that are already prone to creating model errors in double precision arithmetic (64 bits per variable), we carefully study what happens to solutions in polar regions under reduced precision.We test our mixed-precision solver on test-cases from the well-known test-suite for shallow-water dycores [34].The first testcase is the standard problem of the Rossby-Haurwitz Wave with wave number 4. The second test-case paraphrases the standard zonal flow over an isolated mountain by replacing the idealised smooth hill with the natural orography of the Earth.
In this paper, whenever we refer to mixed-precision, we mean the use of different precision levels within the same simulation.This can either refer to the use of single and double precision as available on modern computers, or the use of half, single and double precision.While half precision is available on machine learning accelerators on modern supercomputing hardware, it is currently not yet availble for use in most high-level computer languages.An important milestone in this regard is that Fugaku, the current number one of the TOP500 list of supercomputers, allows for the use of half precision arithmetic from Fortran.For this paper, half precision will be emulated in software only.This enables to easily explore different precision levels without having to purchase and use specialised hardware that would also require a time-consuming rewrite of the model code.
The paper is structured as follows.In Section 2, we describe the continuous as well as discretised model and introduce our elliptic solver and the preconditioner.Section 3 presents the test-cases, the model setup and a description of the reduced-precision experiment suite.In section 4 we show and discuss reduced-precision results.Section 5 concludes the paper.

The Governing Equations
With our choice of model, we aim to stay conceptionally as close as possible to the approach used by other MPDATA based fluid models [35,23,36] and consequently the newly developed IFS-FVM atmospheric model [18,20,21].Thus, we chose to use the shallow-water equations (SWE) on the sphere written in the form of generalized transport equations [37,23]: where Φ is the fluid thickness, ∇ = (∂/∂x, ∂/∂y), Q x = Φ ẋh x and Q y = Φ ẏh y denote the momenta in x (longitudinal) and y (latitudinal) directions, while G = h x h y is the Jacobian of the geospherical framework with h x , h y being the metric coefficients of the general orthogonal coordinates.Here, h x = a cos(φ), h y = a for a lat-lon grid with Earth's radius a, so longitude λ = x and latitude φ = y.The advective velocity v relates to the momentum cf. [37,23] for discussions.The right-hand sides of the equations ( 2) and (3) symbolise the forcing terms R x and R y for the momenta, such that with specific terms, from left to right, representing the pressure gradient, the Coriolis force, the metric forces (viz.Christoffel symbols of the geospherical framework) and the relaxation forcings attenuating the solution to some hypothetical "true" state Q with the inverse time scale α.H 0 is the topography, g the gravitational acceleration and f the Coriolis parameter.

The Discretised Model
The below-summarised procedure of formulating an implicit linear problem for dependent model variables, together with its associated Schur complement (i.e.elliptic Helmholtz equation) is a special case of a substantially more intricate procedure documented recently for all-scale atmospheric Euler equations in [20], whereto the interested reader is referred for further details.
The governing Equations ( 1)-( 3) are discretised in a semi-implicit fashion, using a collocated lat-lon grid with its well-known pole problem [33].In a nutshell, momentum equations are integrated with the trapezoidal rule where n and i index discrete points (t n , x i ) separated with uniform intervals ∆t and (∆λ, ∆φ), respectively, in time and the two spatial directions.The first term on the rhs of ( 7) denotes a second-order MPDATA operator-with v n+0.5 symbolising an O(∆t 2 ) explicit predictor of advective velocity at the intermediate t n + 0.5∆t time level; cf.§3.4 and §4.1 in [37]-and it forms the explicit part of the solution.The second term is unknown, as R n+1 depends (nonlinearly) both on Φ n+1 and Q n+1 .There are several alternative approaches to assure an easily invertable linear problem, involving outer iterations or explicit predictors, or both [18,19,20,21].As our goal is to assess the role of mixed precision on the elliptic solver performance, we here select an ad hock approach that complicates (rather than simplifies) the target elliptic solver.Specifically, the pressure gradient term is approximated as where Φ denotes the explicit second-order-accurate predictor of Φ n+1 , generated by integrating the mass-continuity equation (1) with MPDATA.The Coriolis and relaxation terms are implicit at t n+1 , whereas the nonlinear metric forces-generally acting on a much longer time scale than the scale associated with the propagation of external gravity waves-are predicted explicitly at n + 1 by linear extrapolation from n and n − 1 time levels. 1 Thanks to the collocated grid, the resulting linear problem is inverted analytically, forming the closed-form expressions where Q, A and B denote, respectively, the modified explicit part of the problem, and the 2 × 2 matrix and vector of known coefficient fields.Implementing the expressions (9) in the trapezoidal integral of (1), leads to the elliptic Helmholtz problem for Φ n+1 , which can be symbolically written as hereinafter, spatial and temporal grid indices are dropped as there is no ambiguity.The linear operator L is negative definite but not self-adjoint.It takes the form of a generalized Laplacian where the coefficient fields are straightforward modifications of those appearing in (9), and M = 2.The solution of (10) provides the updated Φ-upon which Q n+1 is recovered from (9) and then v n+1 from (4)-thus allowing to complete (7) for calculations with large ∆t, limited only by the advective CFL condition.

The Elliptic Solver
The elliptic problem (10) is solved using the preconditioned generalized conjugate residual (GCR) approach [38,25].GCR is a Krylov sub-space method that assures monotone convergence of the solver iterations for a non-self-adjont operator L (i.e., "non-symmetric" in matrix representation) by minimising the L 2 norm rr (where .. marks the domain integral) of the residual error here ν numbers the iterations.In essence, GCR cleverly re-optimises at each iteration all coefficients in the series of the current and past successive iterations up to some arbitrary specified number, say k − 1; cf.[39] for a discussion.When after some number of iterations N , L 2 (r) reaches a desired small error tolerance, the solution of this iterative process Φ N is set to be Φ n+1 := Φ N .Figure 1 describes our implementation of the preconditioned GCR algorithm.
Following the step 2 of the algorithm-the update of the depth Φ and the minimised residual error-there is an exit condition.While a variety of exit conditions can be considered [40], we found that demanding the uniform reduction of the initial residual error, assures the stability of the semi-implicit model integrator while maintaining  the solution quality throughout a range of spatial resolutions and physical scenarios.In practice, the choice of is verified experimentally.Based on our overall experience with W&C models, in this paper = 10 −5 is assumed for all experiments.We also enforce at least one full GCR(3) cycle regardless of (13).Further discussion and justification for this choice of is found in §4.1.
The preconditioner P in step 3 approximates L indirectly, by estimating the solution error q = P −1 (r) ≈ e = L −1 (r) by means of a stationary iterationindexed with µ, such that q µ=0 = 0-best characterised as a semi-implicit Richardson scheme [38].Specifically, the operator P is split in two parts.The first part combines the second-order zonal derivative term and the Helmholtz term, P Z − P H ; whereas the second part, marked as P M for its meridional predominance, is the reminder of L and the first part.In the semi-implicit Richardson scheme, the first part is taken at the µ + 1 iteration while P M is lagged behind.This results in a tridiagonal problem where I denotes the identity operator, and η can be interpreted as a pseudotimestep, determined from linear stability theory for the P M operator.For further reference and illustration, we implemented four preconditioning options in the semi-implicit shallow-water model.Options 0 to 2 are elementary, in that: option 0 corresponds to P = I, viz.no preconditioning; option 1 is for P = D, where D symbolises the diagonal part of L in (10); option 2 is a diagonally preconditioned explicit Richardson iteration for full L; and option 3 is the semi-implicit Richardson scheme in (14).Taking the number of GCR iterations with option 0 as a reference, the subsequent options reduce the solver's iteration count by about 10, 40 and 98 percent, respectively.In the experiments reported in this paper, we use solely option 3 with two Richardson iterations that comprise two tridiagonal inversions and one evaluation of P M .We found that this combination yields the best overall performance.
3 Selected test-cases and experimental setup

Preamble
To assess the impact of mixed precision on the performance of semi-implicit W&C models, we exploit two well-established benchmarks for testing and evaluating shallow-water surrogates of dynamical cores for global weather and climate, namely the Rossby-Haurwitz wave with wavenumber four (RHW4) and a zonal flow (Q x /Φ ∝ cos φ, Q y = 0) past the Earth orography.The Rossby-Haurwitz waves are analytic solutions of the nondivergent nonlinear barotropic vorticity equation on the sphere, and as such propagate zonally without change of shape [41].However, Rossby-Haurwitz waves are incompatible [42] with shallow-water equations (SWE), and their SWE simulations lead to unstable solutions, given sufficiently short zonal wavelength and sufficient amplitude [43].In particular when simulated with SWE, the form of RHW4 changes a little over 24 days, while twice shorter Rossby-Haurwitz waves break completely within a week [43].Because of this marginal stability [44], RHW4 is a convenient vehicle to test capability of numerical methods for maintaining subtle nonlinear balance of wave form solutions over an extended integration time.The RHW4 simulation with SWE was included in the suite of tests proposed by Williamson et al. [34], and has become a prominent benchmark in the field.
As the RHW4 problem is particularly smooth, with the analytic solution con-fiened to a few spherical harmonics [44], we complement it with a multi-scale variant of the classical problem of a geostrophically balanced zonal flow past a hill centered in mid latitudes.The original problem has been studied by Grose and Hoskins [45].It is characterised by small Rossby and Froude numbershere R o = U/Lf and F r = U/ √ gH, with U , L and H denoting the characteristic flow speed, horizontal scale of the hill and the mean height of the shallow-water surface, respectively-whereby it is well explained by the linear theory [45].This problem, also proposed by Williamson et al. [34] for evaluating the efficacy of numerical methods for global-scale dynamics, has become a benchmark in the field-see e.g.[46] for its 3D nonhydrostatic extension.
Here we retain the benchmark setup as proposed in [34], except for replacing the smooth localised hill with Earth's orography.This effects in numerical solutions with broad spectra of scales and, effectively, a stiffer elliptic problem.
To quantify the impact of the reduced precision on the solver convergence and solutions quality, we adapt the L 2 and L ∞ error norms defined in [34] as normalised deviations from a "genuine" solution.For the RHW4 test-case, the genuine solution is the analytic solution Ψ 0 (x − Vt, y), where Ψ 0 = Ψ(x, y) represents the initial condition for the velocity and depth fields-given in eqs.( 143), ( 144) and ( 146) of [34]-and V is the propagation speed of the analytic Rossby-Haurwitz wave (eq.142 in [34]).As the analytic wave does not satisfy the shallow-water equations ( 1)-( 3), the deviations of numerical solutions from Ψ 0 (x − Vt, y) are expected to grow in time.However, the comparison provides a meaningful gauge for assessing the impact of a reduction in precision.In the orographic flow test-case the genuine solution is specified as a geostrophically balanced zonal flow, steady in the absence of the orography.For both benchmarks, the calculations addressing reduced precisions are conducted on the three regular longitude-latitude grids, with respective resolutions N X × N Y = 128 × 64, 256 × 128 and 512 × 256.

Highlights of RHW4 simulations
Figure 2 shows the analytic initial condition and the numerical results on the 512 × 256 grid after one and two periods of the analytic RHW4 when the analytic solution exactly reproduces the initial condition.The departures from the analytic result are clearly seen.After one period, the numerical RHW4 is steeper and somewhat retarded compared to the analytic result.After the two periods, the numerical results is less steep but the phase retardation is more pronounced.The lower resolution results (not shown) look similar, which is not surprising as the RHW4 is well resolved (with 32 grid intervals per wavelength) even on the coarsest grid; cf. Figure 2 in [37] that shows a coarse result after 5 days.As expected (see figure 3 in [44]), the current solution is visibly stable, which however should not be taken for granted, as irregularities  To appreciate the strenuosity of the illustrated calculations, consider that the celerity of the external mode ( √ gH o = 280 m/s) is comparable to the speed of sound, while the zonal length interval of the physical grid in the rings of grid points adjacent to the poles is δx = 7675, 1919 and 480 m for the three considered regular grids.The explicit solutions, like those discussed in [37,23], require then temporal intervals ∆t ∼ 40, 10 and 2.5 s, respectively, whereas calculations reported here were conducted with ∆t = 800, 400, and 200 s, respectively.The attained linear rather than quadratic reduction of ∆t with the reciprocal of N Y is due to the specificity of the flow decaying towards the poles.Although the calculations employed weak polar absorbers-with the inverse time scale α increasing linearly from zero at the angular distance φ ≥ 3π/64 away from the poles to 1/(200∆t) at the poles-essentially the same solution can be obtained without any absorbers.However, even weak absorbers can improve the solver convergence by improving diagonal dominance of the linear operator (11).For example, the presented results using weak polar absorbers α = 1/(200∆t) have a 30% reduced Euclidean norm of the residual r N 2 on average when exiting the GCR algorithm (on average 4 iterations) compared to using no polar absorbers (α = 0 everywhere).The calculations illustrated in Figs. 2 and 3, evinced the maximal Courant numbers 118 and 0.31, correspondingly with respect to the celerity of the external mode and the flow speed.In analogy to the RHW4 case, figure 4 shows the numerical results on the 512 × 256 grid after 14.76 days, likewise simulated with ∆t = 200 s.Patternwise, the departures from the initial zonal flows are dramatic, especially in terms of energy and direction of local flows.Quantitatively, the free-surface height perturbations (viz.pressure perturbations) are about 3% and 12% in terms of L 2 and L ∞ norms; i.e., consistent with real weather [20].For reference, the same measures for the RHW4 simulation are of comparable size, with 3% and 7% respectively.In terms of the kinetic energy of the velocity perturbations, the corresponding L 2 and L ∞ norms are 153% and 762% (sic), with the latter reflecting both the direction (say easterly or northerly) and the high speed of local flows (∼ 50 m/s) emerging in the high latitudes of the northern hemisphere.The respective values for the RHW4 case are much smaller in comparison, with 37% and 35%.The maximal Courant numbers with respect to the celerity of the external mode is 116, which is about the same as in the RHW4 case.The maximal Courant numbers with respect to flow velocity is 0.71, which is more than doubled compared to the RHW4 case.Nonetheless, the overall performance of the elliptic solver was comparable at 4 GCR iterations per time step.The essential difference between the orographic flow simulation and the RHW4 setups is in the importance of the absorbers, its inverse time scale here taken as α = 1/(2∆t) consistently with the theoretical considerations [33].

Reduced-precision Experiments
The precision reduction is achieved in three distinct steps.In the first step, precision is reduced from double to single precision (23 significand bits) for the entire shallow-water model.The results of these experiments are presented in §4.1.Based on the experience from the tests with the single precision shallowwater model, a mixed-precision shallow-water model is described in §4.2, that behaves similarly to the double precision reference model with respect to the three criteria outlined later in this section.As single precision arithmetic is easily available on modern CPUs, these reduced-precision experiments can be performed on standard hardware.In the third step, a mixed-precision model is presented in §4.3 that is additionally using half precision for most parts of the elliptic solver.Half precision is emulated with the reduced-precision emulator (rpe v5 library [47]).Using the emulator enables us to explore a precision reduction beyond single precision with model code that was developed for standard computer hardware.
While we emulate the 10 significand bits as in half precision, we keep the number of exponent bits at 11-the double precision standard-which means that the dynamical range of representable numbers remains high.The results should still be representative for the use of half precision as problems with the dynamical range can often be mitigated via rescaling of variables, see for instance [48], especially since the elliptic problem is linear by design.
The precision levels for the mixed-precision elliptic solver are identified as the minimum precision required for each step of the algorithm shown in figure 1.
Because the elliptic solver is part of the timestepping scheme of a non-linear model, it is not obvious how this could be achieved via analytical means.Thus, we adopt an experimental approach with the precision being reduced to levels where the elliptic solver still yields acceptable results.The elliptic solver's performance is studied for each of the two test-cases defined earlier.This covers a wide range of different dynamics, from linear responses up to turbulent regimes with a multiplicity of scales.
We consider three criteria to measure the performance and viability of the elliptic solver when precision is reduced: i) the solver's convergence rate; ii) whether the required solver accuracy was met; and iii) the change in normalised deviations from the respective "genuine" solution in the L 2 and L ∞ error norms.Criterion (i) aims at keeping the solver's convergence rate unchanged compared to the reference solver.A decrease in the solver's convergence rate, and thus an increase in the expected solver iterations, would be acceptable if each iteration is sufficiently cheap when using reduced precision.
Unfortunately, since we do not use real hardware for half precision simulations, we cannot make reliable estimates of the savings.It is thus reasonable to aim for the same number of solver iterations in the mixed-precision solver.Criterion (ii) is straightforward.As soon as the precision reduction is introduced into parts of the solver, there is a risk that the residual errors become corrupted with rounding errors.It is therefore important to show that the elliptic solver is actually capable of converging to the required accuracy.Criterion (iii) is more subtle.The solver accuracy is chosen based on the additional error in the model solution of a double precision reference run, defined as the relative error between the accuracy measures with respect to genuine solutions specified in the last paragraph of §3.1.Given that the residual-error fields of a double precision and a mixed-precision elliptic solver will be different-e.g., due to a different spectral composition-even if the required solver accuracy was reached, the pattern and magnitude of the additional discretisation error might change as well.

Results
In this paper, the focus is primarily on the 512 × 256 resolution, the highest of the three resolutions, as it shows the strongest grid convergence at the poles and is thus expected to be the most challenging test-case for reduced precision.
To evaluate criterion (i) and criterion (ii), the discussed residual error fields are calculated by using definition (12), in double precision arithmetic, applied to the updated fluid thickness field Φ ν .We do not use the minimized residual error field r ν from step 2 of the GCR algorithm in figure 1 as the values of r ν may already be corrupted by rounding errors.
Given these choices, the convergence plots that are an integral part of the following discussion are here introduced in detail to avoid ambiguity.In the convergence plots, the development of the residual error field in the Euclidean norm is shown over successive solver iterations.We number the solver iterations as follows: iteration 0 denotes the initial residual error field L (Φ 0 ) − R, iteration 1 to 3 denote the residual error fields L (Φ ν+1 ) − R, ν = 0, 1, 2 of the first cycle of the GCR(3) algorithm.Following this convention, iteration 4 to 6 denote the fields L (Φ ν+1 ) − R, ν = 0, 1, 2 of the second cycle of the GCR(3) algorithm.To highlight the residual error field's evolution over successive solver iterations, values stemming from different iterations of the same call to the elliptic solver are connected by lines.

Single Precision Shallow-Water Model
In double precision arithmetic, the solver is converging steadyly throughout all solver iterations, see figures 5 a) and b).The initial residual is within the range of 10 3 to 10 4 for both test-cases.The solver requires 4 solver iterations for both test-cases which is also the minimal iteration number since we enforce for at least one full cycle of GCR(3).After these 4 iterations, the initial residual is reduced by almost 8 orders of magnitude for the RHW4 and about 6 orders of magnitude for the orographic flow.
To get a better intuition for the actual size of the corresponding impact to fluid thickness, and how these increments vary over successive solver iterations, statistics of fluid thickness increments are shown in table 1.For both test-cases, fluid thickness increments drastically reduce in magnitude over successive solver iterations by 5-6 orders of magnitudes.For the RHW4, this process happens more quickly, consistent with the slightly higher convergence rate for this test-case.
Given these results, the significance of using = 10 −5 should be discussed.For the orographic flow test-case, the fluid thickness increments can still be up to several centimeters in magnitude until the fourth solver iteration.Errors of such magnitude within a single time-step are unacceptable.For the RHW4, the strongly drecreased fluid thickness increments after the first solver iteration suggest that it would be sufficient to only perform one single solver iteration, or at most two.However, experiments with an elliptic solver restricted to only one or two solver iterations respectively lead to model crashes due to exponentially growing instabilities, in the case of a single solver iteration even within the first twelve hours of simulation time.
In comparison to the double precision convergence behaviour, single precision solver convergence rates change significantly, see figures 5 c) and d).Concerning criterion (i) and (ii), for both test-cases the initial residual is two orders of magnitudes larger and only reduced by three orders of magnitude by the elliptic solver.The solver converges more slowly, seemingly plateauing after the first solver iteration.The single precision solver is typically performing five instead of four solver iterations for double precision for RHW4.
While there are differences in the convergence, concerning criterion (iii) the solution quality is still good.The differences in the normalised deviations from the respective "genuine" model solution in the L 2 and L ∞ norms are insignificant compared to the double precision solution.The largest difference is found for the L ∞ -norm of kinetic energy perturbations for the orographic flow with a value of 758% for the single precision shallow-water model, a relative change of 5 • 10 −3 compared to the double precision solution.
Further elaborating on this point, the above-discussed kinetic energy perturbations as well as their change when going from double to single precision are shown in figures 6 a)-f).The square-root of the deviations can be up to 50 m s for both test-cases.For RHW4 the departures are centered around the wave crests and differences are symmetric between both hemispheres.When going from double to single precision, the differences in the deviations are shown over a range of 4 orders of magnitude from 10 −2 to 10 2 .The difference is at least two orders of magnitude smaller than the values of the deviations.These differences can be considered insignificant, especially when considering how sensitive the test-case is towards perturbations due to its inherent unstable character.
For the orographic flow, the deviations from the genuine solution show the locations and magnitude of synoptic-scale eddies, as well as regions with high mountains, such as Himalaya, where fluid thickness is small compared to orography and orography gradients are large making the associated flow exhibit strongly non-linear behaviour.The most significant differences between single and double precision are centered around the Himalaya and its downstream area.The differences are about one order of magnitude smaller than the deviations to the genuine solution.
To further put the change in kinetic energy deviations into perspective, we perform an additional simulation in double precision that is run from randomly perturbed initial conditions for fluid thickness (adding white noise with an amplitude of five centimeters).These perturbations represent the inherent uncertainty of the system and inevitable errors in the initial conditions for simulations of W&C.It is found that the changes in kinetic energy depar-tures between the perturbed and unperturbed double precision simulations are comparable to the differences between the simulations in single and double precision.Thus, the change in departures when going to single precision will probably be insignificant when compared to the uncertainty range typically encountered in W&C prediction and can therefore be considered as negligible.
In summary, although the solver of the single precision model variant behaves differently, the model still produces valid solutions for our test-cases.The fact that the shallow-water model is crashing if only a single solver iteration is performed in the elliptic solver seems to disagree with the result that the single precision model is producing valid solutions.This is because, although the single precision model uses more than one solver iteration, the convergence results in figures 5 c) and d) are showing no significant reduction of the residual error after the first solver iteration.Still, the additional iterations seem to improve the solution sufficiently to prevent model crashes.Therefore, the popular and widespread use of the L 2 -norm of the residual error field to measure the progress of the elliptic solver seems to have reached its limits for the reduced precision simulation.Further tests with the L ∞ -norm as exit condition yield the same result.This clearly indicates that, when using reduced precision in the elliptic solver, extra care needs to be taken when analyzing elliptic solver convergence and also when selecting a suitable-physically-relevant for the problem at hand-exit condition.
To understand why the single and double precision solvers show a different convergence behaviours, we first take a look at the fluid thickness variable Φ.The fluid thickness variable is stored in single precision arithmetic and its magnitude is on the order of 10 4 to 10 5 meters.In comparison, a typical fluid thickness increment at each time-step is on the order of (see again table 1) 3 • 10 −5 m to 5 • 10 −7 m for the last solver iteration.However, the relative error of single precision truncation is given by the machine epsilon ≈ 1.19 • 10 −7 , indicating that thickness increments smaller than 10 −2 m to 10 −3 m cannot reliably be represented in the fluid thickness field Φ.This renders most of the later solver increment field unused, even if the single precision solver were to perform as good as the double precision elliptic solver.
This gives an indication that the single precision solver variant might actually be performing better than figures 5 c) and d) indicate.This can be investigated by additionally introducing a double precision copy of the initial fluid thickness Φ 0 into the single precision elliptic solver and calculating the residual error fields from this double precision variable.The double precision copy of Φ 0 is continuously kept updated by the single precision solver's fluid thickness increments after each solver iteration.The convergence rates derived in this way, see figure 7, look drastically different and much closer to the convergence rates of the double precision variant of the elliptic solver from figures 5 a) and b).The initial residual is still larger when compared to the double precision simulations, but throughout the solver iterations the initial residual error is now reduced by 5 to 6 orders of magnitude.This shows that the single precision solver variant actually converges well even after the first solver iteration, the solver increments are just not reliably being added to the fluid thickness field Φ.
To understand the remaining differences, i.e. the much larger magnitude of initial residual errors and the still reduced convergence rate, the solver's convergence is investigated for each latitude separately.Given the insights from the last paragraph, from now on residual error fields are always calculated from a double precision copy of Φ.The reason being that for the mixed- Fig. 7. Convergence rates of the RHW4 (left) and the orographic flow test-case (right) for the shallow-water model using single precision.In contrast to figures 5 c) and d), the residual error fields are however calculated from a double precision copy of the initial fluid thickness Φ 0 that is updated by the solver increments after each solver iteration.
precision model, which we ultimately aim to design, to behave similarly to the double precision elliptic solver variant, an increase in precision appears to be unavoidable for the fluid thickness variable Φ.
Figures 8 a)-d) show the convergence rates for each latitude.Here only results for the orographic flow are shown as the RHW4 test-case behaves qualitatively similar.The differences in the L 2 -norm of the residual values are only visible near the poles, where the single precision model's values are three to four orders of magnitude larger than for double precision for all solver iterations throughout the simulation.For the other latitudes, the L 2 -norms of the residual error fields are however indistinguishable by eye for single and double precision.When the solver reaches convergence, see figure 8 b) and d), the double precision elliptic solver has managed to consistently reduce the L 2 -norm of the initial residual values by five orders of magnitude near the poles and six orders of magnitude in the rest of the computational domain.In comparison, in the single precision model variant the initial values are decreased by about six orders of magnitude everywhere.
The accumulation of numerical errors near the poles of the single precision model can be replicated in the double precision model variant when the elliptic solver's initial residual r 0 is calculated with a fluid thickness Φ 0 that was truncated to single precision.Even if all model calculations are performed in double precision arithmetic, the truncation leads to numerical errors in the L 2 -norm of the initial residual error field r 0 on the order of 10 −2 for the RHW4 and 10 −3 for the orographic flow test-case when compared to a double precision solver without precision truncation.The spatial distribution of these numerical errors is illustrated in figure 9.The relative error in the L 2 -norm is around 10 −5 for the majority of latitudes for both test-cases, while the relative numerical errors in the L 2 -norm at the poles can easily be as large as 100%.This shows how badly conditioned the linear operator L is near the poles.The resulting solver convergence of the truncated elliptic solver is illustrated in figures 10, showing a clear degradation when compared to the double precision elliptic solver from figures 5 a) and b).The convergence behaviour is instead similar to the single precision elliptic solver variant that was observed in figure 7, especially for the first two solver iterations.The precision of the fluid thickness variable Φ 0 is thus of major importance for the calculation of the initial residual error field r 0 and the subsequent convergence behaviour of the elliptic solver.For substantiation, additional experiments are run using stronger polar absorbers.For the RHW4, the single precision model variant is run with α = 1/(2∆t) but it is found that, although the numerical errors in the L 2 -norm of the initial residual is reduced by half, solver convergence is not improved.Fig. 10.Convergence rates of the RHW4 (left) and the orographic flow test-case (right) for the double precision shallow-water model variant where, in contrast to figures 5 a) and b), the elliptic solver's initial residual r 0 is calculated with a fluid thickness variable Φ 0 that was truncated to single precision.

Mixed-precision Shallow-Water Model
Here, a mixed-precision model is described that shows a very similar convergence behaviour when compared to the double precision variant while performing as many calculations as possible in single precision.
First, based on our previous analysis, the fluid thickness field Φ used for the initial residual r 0 calculation in algorithmic step 1 of the elliptic solver and the update of the model variable Φ in step 2 should be kept in double precision.This motivates an approach to keep the prognostic model variables Φ, Q x , and Q y in double precision.The variables are kept and updated in double precision while the costly part, the calculation of their respective tendencies within a time-step, are calculated in single precision arithmetic.Updating the fluid thickness field Φ ν directly in algorithmic step 2 is replaced by accumulating the solver increments to fluid thickness in a single precision variable ∆Φ ν which is then added to the fluid thickness field Φ n when the elliptic solver is exited.This ensures that summing up the solver increments can still be performed in single precision as the value range of ∆Φ ν is on the order of 10 1 m which means that adding solver increments of as low as 10 −7 m is possible without excessive rounding error.Second, the initial residual r 0 calculation is performed in double precision to ensure that the relative error in r 0 is smaller than .And third, we find that the forces R i n+1 -the last term in equation ( 7)-need to be calculated in double precision to avoid numerical errors at the poles.Again, the fluid thickness field Φ plays a crucial role as, similarly to the calculation of r 0 , the pressure gradient term from R i n+1 is responsible for a large fraction of the numerical errors at the poles.
The majority of the model is still performed in single precision.This includes the entire second-order MPDATA operator (first term on the rhs of ( 7)), and the calculation of the advective velocities v n+0.5 .These precision choices already define most of the mixed-precision shallowwater model.Only the precision choice for the elliptic solver cycle-algorithmic steps 2 to 4-still needs to be discussed.Here, three options are explored: • For option one, double precision is employed in the entire elliptic solver.As a result solver convergence is indistinguishable from the double precision shallow-water model (figures 5 a), b) and figures 8 a)-b)).• In the second option, single precision is used throughout the entire elliptic solver cycle.The consequence of this choice is that the latitude-wise L 2norms of the final residual error fields show a reduction of the convergence rate near the poles.The error is smaller (about an order of magnitude) when compared to results from figure 8 d), but still clearly visible.However, the reduced solver convergence at the poles is not causing an increase of the initial residual error fields near the poles.Thus one can conclude that the rest of the mixed-precision shallow-water model time-step is not affected, possibly also due to the presence of the polar absorber.For the orographic flow test-case, overall solver convergence is reduced by up to an order of magnitude.For the RHW4, using single precision reduces the overall solver convergence by about 2 orders of magnitude, which is mainly due to a reduced solver convergence rate near the poles.• The third option is a blend of the first two options.For this option, the elliptic solver cycle employs single precision everywhere except for a three zonal band wide range closest to the North-and Southpole respectively where all calculatuions are performed in double precision.With this choice, solver convergence is indistinguishable from double precision for the orographic flow test-case.For the RHW4, there are small differences found for the fourth solver iteration only.As a result, overall loss in solver convergence is at most 1 order of magnitude when compared to double precision.
These three options are all viable and it is ultimately the modeller's choice which of the options to choose.Here, we decide to proceed with the single precision elliptic solver cycle (option two).There are strong arguments for this choice.The disadvantage of a slightly reduced overall convergence rate in the L 2 -norm seems insignificant as it is clearly understood where this reduction stems from, and there is no negative impact on the shallow-water model's or the elliptic solver's behaviour in any way.Especially since the poles are regions where the polar absorbers are acting as a regularizing element anyhow.
The elliptic solver convergence of the resulting mixed-precision shallow-water model is shown in figure 5 e)-f) and figure 8 e)-f).The global, as well as latitude-wise, representation of the solver convergence show that much of the behaviour of the double precision model variant is restored.Figure 8 e) shows that there is no accumulation of numerical errors at the poles anymore for the orographic wave test-case (RHW4 not shown but qualitatively the same).As discussed earlier, the convergence rate at the poles is slightly reduced, see figure 8 f).The solver's overall convergence, figure 5 e)-f), is then completely restored up to the second and third solver iteration for the RHW4 and the orographic flow test-case respectively while differences remain for subsequent iterations.Concerning criterion (iii), the difference in deviations from the genuine solution is one order of magnitude smaller when compared to the difference found between single precision and double precision from figures 6 e)-f) (not shown here).

Reformulation of the Initial Residual r 0 Calculation
Based on figure 9, a straightforward single precision calculation of r 0 was shown insufficient.In the previous section, double precision was instead used for this caclulation.However, since the calculation of r 0 is a computationally expensive operation, there is a strong incentive to explore further precision reduction via other means.
Rather than using higher precision, for some Krylov subspace methods a popular option to manage the accumulation of round-off errors in the residual error field is a complete recalculation of the residual after a set number of solver iterations ( [49]).However, each recalculation adds another application of the linear operator L to the total cost of the solver.In our setup we only encounter up to 4 solver iterations until solver convergence, which renders a complete recalculation of the residual values to be computationally inefficient.Additionally, if the numerical errors in r 0 become as large as the value itself near the poles, recalculation of r 0 becomes meaningless.
Here, we explore a different direction.The numerical errors in r 0 ultimately result from the truncation of the fluid thickness field Φ to single precision.We present how splitting parts of the residual calculation into the form of a base part plus a correction term can significantly reduce the numerical error in r 0 .Since the operator L is linear by design reformulating the calculation of r 0 can be done with minimal effort.
As a preparation for splitting the operator calculation, first the fluid thickness variable Φ is split into two parts.Instead of a double precision variable Φ, the fluid thickness field is now chosen to be described by two single precision fields: Φ ≈ Φ + Φ, where Φ denotes a correction to some to-be-defined base fluid thickness field Φ.
At the beginning of the simulation, Φ and Φ are set via the following procedure: (1) Φ is set to the initial fluid thickness truncated to single precision (2) Φ is then defined as the difference calculated in double precision arithmetic between Φ and Φ 0 truncated to single precision The average value of Φ is then on the order or 10 4 m to 10 5 m, while the average value of Φ is 10 −3 m-a difference of 7 to 8 orders of magnitude which is consistent with the machine epsilon of single precision.The relative error of the sum Φ + Φ when calculated in double precision arithmetic is on the order of 10 −15 which is consistent with the double precision machine epsilon.
During the simulation, the solver increments to fluid thickness are added to Φ each time-step.Thus, the average values of Φ increase in time which increases the relative error in Φ and by definition also of the sum Φ+ Φ.After about 100 time-steps, the average values of Φ are on the order of 10 1 m, and the relative error of the sum Φ + Φ performed in double precision arithmetic increases to about 10 −11 .If this continued further, eventually the relative errors of the sum Φ + Φ would reach the single precision machine epsilon.
As a result, in our approach we decide to update Φ after every 100 time-steps.This is a good compromise between precision and computational overhead.The procedure to update Φ and Φ works as follows: (1) Temporarily store the sum in a double precision variable Φ DP (2) Φ is set to the initial fluid thickness truncated to single precision Φ = trunc(Φ DP ). ( (3) Φ is then defined as the difference calculated in double precision arithmetic between Φ and Φ 0 truncated to single precision Since this update happens so rarely throughout the model simulation, the computational overhead is negligible.
So far, the split of a double precision variable into two single precision variables has not resulted in any compuational gains.The trick to make this approach computationally efficient lies in rewriting the gradient calculation of Φ, i.e. the partial derivatives ∂Φ ∂x J , in the linear operator L of definition (11).The rewritten linear operator reads as following: In this, Φ SP = Φ + Φ is the fluid thickness in single precision.The ∂ Φ ∂x J only change every 100 time-steps and can thus be precomputed and stored.Compared to the original definition (11), the operator (20) consists of two more floating point operations per grid-point-the addition of the partial derivatives-but each of these is now performed in single precision arithmetic.
Comparing figures 9 and the latitude-wise relative numerical errors when r 0 is calculated using the single precision operator (20), see figures 11, the relative numerical error is strongly reduced.Near the poles, the reduction in relative errors is 4 orders of magnitude, while for the rest of the computational domain it is about 2 orders of magnitude, reaching the theoretical limit given by the single precision machine epsilon for most of the domain.
Using the new operator (20) in conjunction with the mixed-precision model of §4.2 reveals that using the new operator throughout the entire computational domain is not desirable.This is because, although the relative error in each calculation of the initial residual error field r 0 is greatly reduced near the poles, the remaining errors are exacerbated strongly due to the use of single precision in algorithmic steps 2-4.This leads to a vastly reduced convergence rate near the poles, with the latitude-wise L 2 −norm of the final residual errors being up to 10 −1 for both test-cases, looking very similarly to figure 8 d) that was found for the full single precision shallow-water model variant.As a consequence, the latitude-wise L 2 -norm of the initial residual error field r 0 near the poles can be as much as one order of magnitude larger than the values found for the mixed-precision shallow-water model of §4.2, see again figure 8 e).
To avoid this accumulation of numerical errors near the poles, from hereon the standard operator (11) is instead used for a three zonal band wide range closest to the North-and Southpole respectively, where all calculatuions are performed in double precision.Model behaviour is then completely recovered to that of the mixed-precision shallow-water model of §4.2.

Mixed-precision with Half Precision for the Elliptic Solver
In this section the aim is to take down precision to half precision for as many parts of the elliptic solver as possible while still satisfying a reasonably high solver accuracy.
The initial residual error r 0 calculation in algorithmic step 1 is not considered for half precision arithmetic as the machine epsilon of half precision is 9.77•10 −4 which means it is far larger than our targeted solver accuracy.The discussion in §4.2.1 thus indicates single precision to be the limit within this setting.
The focus is here primarily on the preconditioning and the application of the linear operator (11) in algorithmic steps 1 and 3.This is in part motivated by the fact that these steps are by far the most computationally intensive parts of the elliptic solver, see the normalized runtimes for GCR(3) next to the flow chart in figure 1.It is found that these parts of the solver can indeed be taken down to half precision for the most part.Two exceptions from this rule however need to be made.The first exception concerns the last operation of the linear operator (11) where CΦ is subtracted from the previous terms.This calculation is done in single precision because CΦ and the remaining terms of the linear operator (11) are found to be of similar size and the subtraction of both terms introduces large numerical errors if done in half precision.The second exception is the area close to the poles.Since the operator ( 11) is ill-conditioned near the poles, half precision is found to be crititcal in these regions.Thus, for grid-points within a three zonal-band wide range closest to the North-and Southpole all calculations of the preconditioner and the linear operator are performed in single precision.
In comparison to the preconditioning and the application of the linear operator, algorithmic steps 2 and 4 are computationally cheap in terms of the normalized runtimes, consisting only of global sums and variable updates.The large range in the normalized cost of step 4 is due to the size of the sum changing with ν.Concerning the updates of the residual error field r ν+1 and the accumulation of solver increments to fluid thickness in the variable ∆Φ ν+1 , both are kept in single precision.Both require precision higher than the machine epsilon of half precision to satisfy the solver accuracy.Concerning L(p ν+1 ), experiments where the sum is calculated in half precision arithmetic immediately reduces the attainable solver accuracy to 4 orders of magnitude for both test-cases.Additionally all the summands are only available in single precision, which makes single precision the preferred option.However, the sum to obtain p ν+1 can be performed in half precision.Since all the summands are in half precision as well this also makes sense from a computational point-of-view.
Concerning the global sums for β and α l in algorithmic steps 2 and 4, single precision is the straightforward choice for these caclulations.This is simply because all of the involved input variables are single precision and truncating them to half precision first would likely result in a computational overhead.The elliptic solver convergence of the resulting mixed-precision shallow-water model is illustrated in figure 12 a)-d).The global, as well as latitude-wise, representation of the solver convergence show that much of the behaviour of the double precision model variant is preserved.Even though precision is reduced to such a low level, the overall accuracy reached within four solver iterations is still about 1 • 10 −5 for the RHW4 and 5 • 10 −5 for the orographic flow test-case, see figures 12 a)-b).For both test-cases, there is a clear degradation of solver convergence for the last solver iteration as there is no further reduction in the L2-norm of the residual error fields.For the RHW4, solver convergence is already slightly reduced for the third iteration similarly to what could be found for the mixed-precision shallow-water model in figure 5 e).The plateuing for the last solver iteration indicates that the accuracy limit has been reached for this mixed-precision elliptic solver variant.
Figure 12 c) shows that there is no accumulation of numerical errors at the poles for the orographic wave test-case.In Figure 12 d), the impact of enforcing higher precision in the vicinity of the poles becomes visible.The final values for the latitude-wise residual error L2-norm near the poles are smaller in comparison to the rest of the computational domain, where the effect of low precision manifests in overall larger values when compared to previous results, see again figures 8 b), d), f).This however means that the convergence rate is now very consistent throughout the computational domain, with similar residual error reductions at all latitudes.The corresponding figures for the RHW4 are not shown but look qualitatively the same.
Although the final residual error values are generally larger, the difference in deviations from the genuine solution-criterion (iii)-is comparable to the mixed-precision shallow-water model introduced in §4.2 for both test-cases, i.e. the difference is one order of magnitude smaller than the difference found between single precision and double precision from figures 6 e)-f).
For completeness, the two coarser model resolutions 256×128 and 128×64 were run using the mixed-precision shallow-water model described in this section.As expected, it is found that the results are qualitatively the same (not shown here) and the analysis that has been done for the 512 × 256 applies as well to the coarser model resolutions.The vicinity near the poles for which higher precision levels are used is chosen to be two and one for the 256 × 128 and the 128 × 64 model resolution respectively.

Expected Computational Performance of the Mixed-Precision Solver
Ultimately, we are interested in increasing the solver's, and thus the entire model's, efficiency.For the simulations done in §4.1 and §4.2, actual time measurements can be performed to estimate the computational savings.The single precision shallow-water model of §4.1 has a shorter time-to-solution than the double precision shallow-water model, the model runtime is reduced by 35 %.The mixed-precision shallow-water model variant of §4.2 was 30 % faster than the double precision variant.For larger problem sizes, these differences between double and single precision performance are generally expected to become even larger.For instance, for state-of-the-art W&C models single precision simulations are typically reporting a decrease of computing time by 40% in comparison to simulations in double precision [50,13].
For the mixed-precision elliptic solver with half-precision arithmetic ( §4.3), it is, however, non-trivial to perform a reliable and generic estimate of potential savings from the mixed-precision approach for the elliptic solver.This is due to the lack of robust and precise performance models for solvers.Furthermore, the various possible combinations of software and hardware in compute systems, such as different types of computing chips (CPU, GPU, TPUs, FP-GAs...), different compilers, the memory layout (cache sizes and hierarchy, shared and/or distributed memory access...), connectivity between comput-ing nodes (network topology, bandwidth...) are simply too abundant to make general statements.
To provide a rough estimate of potential savings, we here evaluate the number of floating point operations and their respective precision level.In accordance with the reduction in computational bits from double precision, we assume a speed-up of factor 2 for a reduction to single precision level and respective factor of 4 for half precision.These assumptions are in accordance with the peak performance of modern processors and is also justified by the increase of cache, memory and bandwidth efficiency.Some hardware accelerators that are optimised for machine learning, such as TensorCores on NVIDIA Volta GPUs, can even get a much higher relative increase in peak performance for half precision arithmetic.However, since modern weather and climate models are typically not able to utilise peak performance, these numbers may turn out to be too optimistic if a model simulation is limited by latency of communication or if the operations are not vectorized.
To estimate the total reduction of runtime we multiply the number of floating point operations that has been performed with the three different precision levels (double, single and half) with their respective estimated cost (1.0,0.5, 0.25) for the different algorithmic steps.Including the exceptions for precision levels that we made in §4.3, the overall runtime can be expected to reduce by a factor of 3.3, when comparing the mixed-precision elliptic solver with half-precision to the double precision variant.To further put these numbers into perspective, completing the first solver iteration with the double precision elliptic solver would already be as computationally expensive as running the entire four solver iterations with the mixed-precision solver.The result by the UK MetOffice [13] that a single precision solver is yielding a speed-up close to a factor of two in comparison to a double precision solver indicates that our cost estimate may be achievable on real hardware.

Conclusion
Mixed-precision in a representative semi-implicit shallow-water model in general and its preconditioned elliptic solver in particular is found to be possible and overall advantagous when compared to the default double precision option.The described mixed-precision shallow-water model using double and single precision clearly outperforms the double precision model in terms of time-to-solution while maintaining the same solution quality and similar solver convergence behaviour.Concerning the elliptic solver, most of the computationally expensive components of the solver can be performed almost entirely in half precision arithmetic.This being said, it is found that the special structure of our problem does not permit a naive reduction of precision.We should expect the exacerbation of artificial modes in the prognostic variable fields of the model when going to too low precision levels in the elliptic solver.In our configuration, precision reduction close to the poles-the two grid-singularities that were presentcaused spurious behavior even for the single precision shallow-water model.Overall, as the solver precision is further and further reduced to its limits throughout this paper from double precision to precision levels as low as half precision, a very clear progression is seen in terms of an increase in numerical errors, reduced convergence rates, and the increasing need for assigning higher precision levels for parts of the computational domain.
On the other hand, it is shown that given some knowledge about the specific structure of the application, issues with reduced precision can in fact be mitigated by using higher precision arithmetic in parts of the computational domain.The path towards a computationally efficient mixed-precision elliptic solver can be a challenging one though, as it requires a holistic way of thinking about the interplay between numerical errors from a local precision reduction, to the resulting global model errors, and the solver's overall convergence rate.Understanding this interplay may sometimes require in-depth analysis and in some cases even the reformulation of parts of the existing solver algorithm.The approach we found might also be beneficial in other applications.
All in all, results are similar to [3,28,29,30,2,31,32,4] in that the preconditioning step is very robust against reducing precision and seems to be the part of the elliptic solver where performance gains can be expected while many of the other parts can be quite sensitive to reducing precision.It is an important finding that this general structure can also be found in our specific problem from geophysical fluid dynamics, using a conjugated residual solver with a preconditioner based on tridiagonal inversion, a key component in 3D atmospheric semi-implicit grid-point models.
A reccurent theme in previous publications on reduced precision in W&C models is the motivation that there are irreducible uncertainties (such as initial condition uncertainties, stochastic representations of unresolved subgrid-scale processes) present in these models which gives a justification for accepting a certain amount of additional model errors, in the form of reducing precision, in order to increase the overall model's computational efficiency [9,10,8,7,6,51].It should be noted that in the elliptic solver this idea of accepting a certain amount of additional model error to increase computational efficiency is at the very core of the approach and is inherently represented and used in the process of setting the solver accuracy.
In summary, given our results, we do not see any major obstacle that would prevent us from using reduced precision arithmetic for an elliptic solver ap-plied to a full 3-dimensional atmospheric model.There is much experience within the atmospheric modeling community concerning the behavior of the height/pressure solve via iterative methods on various computational grids which can be used to circumvent possible pitfalls when reducing precision and instead guide the path towards efficient mixed-precision elliptic solvers for W&C models.

Fig. 1 .
Fig.1.From left to right: a flow chart of our preconditioned GCR(k) implementation; normalized runtimes of the four distinct algorithmic steps for GCR(3) (All performance measurements for this paper have been performend on a single core (CPU: Intel i5, Fortran compiler: GCC 9.3.0.) and they agree with the ratio of the number of occurring floating point operations); and the choice of base precision for each algorithmic step (The choices are described in more detail in §4.3).

Fig. 2 .
Fig. 2. RHW4, instantaneous normalised free-surface perturbations, (Φ(x, y, t) − H o )/H o (H o = 8 km), with imposed corresponding flow vectors.From the top to bottom, respectively, are the initial condition and the numerical results after 7.38 and 14.76 days, corresponding to one and two periods after which the analytic RHW4 propagates by one and two full wavelengths.The perturbation isolines are plotted with the contour interval 0.025 (zero contour lines are not shown), and the reference 100 m/s flow arrow is shown in the lowest right corner.

Fig. 3 .
Fig. 3. RHW4, the normalised free-surface departure from the analytic result, (Φ(x, y, t) − Φ 0 (x − Vt, y))/H o , with imposed corresponding departures of the flow vectors.The top and bottom panels show the results after t = 7.38 and t = 14.76 days, respectively.The contour interval and the reference flow arrow are the same as in figure 2; zero contour lines are not shown and the dashed contours correspond to negative values.

Fig. 4 .
Fig. 4. Normalised free-surface height perturbations (H(x, y, t) − H(x, y, 0))/H o (here H = Φ + H 0 and H o = 8 km), with imposed corresponding flow vectors at t = 14.76 days.The perturbation isolines are plotted with the contour interval 0.00625 (dashed lines are for negative values and zero contour lines are not shown), and the reference 20 m/s flow arrow is shown in the lowest right corner; the contours of the orography H 0 (x, y) span the range 3 to 7003 m with the 1200 m interval.

4 • 10 −2 m 3 • 10 −5 m 8 • 10 −4 m Table 1
Size of the increments for fluid thickness Φ 0 for the RHW4 and the orographic flow test-case at different iterations of the elliptic solver.The upper values provide the average magnitude and the lower values the maximum value throughout the model run.

Fig. 5 .
Fig. 5. Convergence rates of the RHW4 (left) and the orographic flow test-case (right) for three variants of the shallow-water model, a) and b) use double precision arithmetic, c) and d) single precision, e) and f) a mixture of double and single precision.Convergence is shown for 2-day time interval snapshots, and darker colours indicate later simulation time.

Fig. 6 .
Fig. 6.Square-root of kinetic energy deviations in m s from the respective "genuine" model solution at t = 14.76 days for the RHW4 in a) double precision, c) single precision, and the orographic flow test-case in b) double precision, d) single precision.Figures e) and f) show the respective difference in kinetic energy deviations between the double precision and single precision shallow-water model solution in logarithmic scale.

Fig. 8 .
Fig. 8. Latitude-wise inital and final residual in the L 2 -norm for the orographic flow test-case for three variants of the shallow-water model, a) and b) use double precision arithmetic, c) and d) single precision, e) and f) a mixture of double and single precision.

Fig. 9 .
Fig. 9. Latitude-wise relative numerical error in the L 2 -norm of the inital residual r 0 when the elliptic solver's initial residual is calculated with a fluid thickness variable Φ 0 that was truncated to single precision.Results are shown for the RHW4 (left) and the orographic flow test-case (right) in 2-day time interval snapshots; darker colours indicate later simulation time.

Fig. 11 .
Fig.11.Latitude-wise relative numerical error of the inital residual r 0 calculated with the operator(20) with respect to the original operator(11) in double precision arithmetic in the L 2 -norm for the RHW4 (left) and the orographic flow test-case (right) in 2-day time interval snapshots; darker colours indicate later simulation time.

Fig. 12 .
Fig. 12. Convergence rates of the RHW4 a) and the orographic flow test-case b) for the mixed-precision elliptic solver variant with half precision, and the corresponding latitude-wise inital c) and final d) residual error in the L 2 -norm for the orographic flow test-case.